Information processing apparatus, information processing method, and program

ABSTRACT

An information processing apparatus acquires a plurality of pieces of sound information, sound collection device position information, and target subject position information. In addition, the information processing apparatus specifies a target sound of a region corresponding to a position of a target subject from the plurality of pieces of sound information based on the acquired sound collection device position information and the acquired target subject position information. Further, the information processing apparatus generates target subject emphasis sound information indicating a sound including a target subject emphasis sound in which the specified target sound is emphasized more than a sound emitted from a region different from the region corresponding to the position of the target subject indicated by the acquired target subject position information in a case in which a virtual viewpoint video is generated.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation application of InternationalApplication No. PCT/JP2020/027696, filed Jul. 16, 2020, the disclosureof which is incorporated herein by reference in its entirety. Further,this application claims priority from Japanese Patent Application No.2019-138236 filed Jul. 26, 2019, the disclosure of which is incorporatedherein by reference in its entirety.

BACKGROUND 1. Technical Field

The technology of the present disclosure relates to an informationprocessing apparatus, an information processing method, and a program.

2. Related Art

JP2018-019294A discloses an information processing system that processesan image and a sound corresponding to any viewpoint based on a pluralityof image signals imaged by a plurality of imaging apparatuses and aplurality of sound collection signals collected in a plurality of soundcollection points. The information processing system discloses inJP2018-019294A comprises an acquisition unit that acquires a viewpointposition and a visual line direction with respect to an imaging target,a decision unit that decides, depending on the viewpoint position andthe visual line direction, a listening point which is a reference forgenerating a sound signal corresponding to the image depending on theviewpoint position and the visual line direction, the image being basedon the plurality of image signals, and a sound generation unit thatgenerates the sound signal depending on the listening point based on theplurality of sound collection signals. In addition, here, the decisionunit further decides a listening range which is a spatial range which isa reference for selecting the sound collection point of the soundcollection signal used for generating the sound signal, and the soundgeneration unit generates the sound signal depending on the listeningpoint and the listening range based on the plurality of sound collectionsignals.

SUMMARY

One embodiment according to the technology of the present disclosureprovides an information processing apparatus, an information processingmethod, and a program which can contribute to listening to a soundemitted from a region corresponding to a position of a target subjectindicated by a generated virtual viewpoint video.

A first aspect according to the technology of the present disclosurerelates to an information processing apparatus including an acquisitionunit that acquires a plurality of pieces of sound information indicatingsounds obtained by a plurality of sound collection devices, a soundcollection device position information indicating a position of each ofthe plurality of sound collection devices, and a target subject positioninformation indicating a position of a target subject in an imagingregion, a specifying unit that specifies a target sound of a regioncorresponding to the position of the target subject from the pluralityof pieces of sound information based on the sound collection deviceposition information and the target subject position information whichare acquired by the acquisition unit, and a generation unit thatgenerates target subject emphasis sound information indicating a soundincluding a target subject emphasis sound in which the target soundspecified by the specifying unit is emphasized more than a sound emittedfrom a region different from the region corresponding to the position ofthe target subject indicated by the target subject position informationacquired by the acquisition unit based on viewpoint position informationindicating a position of a virtual viewpoint with respect to the imagingregion, visual line direction information indicating a virtual visualline direction with respect to the imaging region, angle-of-viewinformation indicating an angle of view with respect to the imagingregion, and the target subject position information in a case in which avirtual viewpoint video is generated by using a plurality of imagesobtained by imaging the imaging region by a plurality of imagingapparatuses in a plurality of directions.

A second aspect according to the technology of the present disclosurerelates to the information processing apparatus according to the firstaspect, in which the generation unit selectively executes a firstgeneration process of generating the target subject emphasis soundinformation, and a second generation process of generating integrationsound information indicating an integration sound obtained byintegrating a plurality of the sounds obtained by the plurality of soundcollection devices based on the sound information acquired by theacquisition unit.

A third aspect according to the technology of the present disclosurerelates to the information processing apparatus according to the secondaspect, in which the generation unit executes the first generationprocess in a case in which the angle of view indicated by theangle-of-view information is less than a reference angle of view, andexecutes the second generation process in a case in which the angle ofview indicated by the angle-of-view information is equal to or more thanthe reference angle of view.

A fourth aspect according to the technology of the present disclosurerelates to the information processing apparatus according to any one ofthe first to third aspects, in which indication information forindicating a position of a target subject image showing the targetsubject in an imaging region image showing the imaging region isreceived by a reception unit in a state in which the imaging regionimage is displayed by a display device, and the acquisition unitacquires the target subject position information based on correspondenceinformation indicating a correspondence between a position in theimaging region and a position in the imaging region image showing theimaging region, and the indication information received by the receptionunit.

A fifth aspect according to the technology of the present disclosurerelates to the information processing apparatus according to any one ofthe first to third aspects, in which an observation direction of aperson who observes an imaging region image showing the imaging regionis detected by a detection unit in a state in which the imaging regionimage is displayed by a display device, and the acquisition unitacquires the target subject position information based on correspondenceinformation indicating a correspondence between a position in theimaging region and a position in the imaging region image showing theimaging region, and a detection result by the detection unit.

A sixth aspect according to the technology of the present disclosurerelates to the information processing apparatus according to the fifthaspect, in which the detection unit includes an imaging element, anddetects a visual line direction of the person as the observationdirection based on an eye image obtained by imaging eyes of the personby the imaging element.

A seventh aspect according to the technology of the present disclosurerelates to the information processing apparatus according to the fifthaspect, in which the display device is a head mounted display mounted onthe person, and the detection unit is provided on the head mounteddisplay.

An eighth aspect according to the technology of the present disclosurerelates to the information processing apparatus according to the seventhaspect, in which a plurality of the head mounted displays are present,and the acquisition unit acquires the target subject positioninformation based on the detection result by the detection unit providedon a specific head mounted display among the plurality of head mounteddisplays, and the correspondence information.

A ninth aspect according to the technology of the present disclosurerelates to the information processing apparatus according to any one ofthe fifth to eighth aspects, in which the generation unit does notgenerate the target subject emphasis sound information in a case inwhich a frequency at which the observation direction changes per unittime is equal to or more than a predetermined frequency.

A tenth aspect according to the technology of the present disclosurerelates to the information processing apparatus according to any one ofthe fifth to eighth aspects, further including an output unit that isable to output the target subject emphasis sound information generatedby the generation unit, in which the output unit does not output thetarget subject emphasis sound information generated by the generationunit in a case in which a frequency at which the observation directionchanges per unit time is equal to or more than a predeterminedfrequency.

An eleventh aspect according to the technology of the present disclosurerelates to the information processing apparatus according to any one ofthe fifth to eighth aspects, in which the generation unit generatescomprehensive sound information indicating a comprehensive soundobtained by integrating a plurality of the sounds obtained by theplurality of sound collection devices, and intermediate soundinformation indicating an intermediate sound in which the target soundis emphasized more than the comprehensive sound and suppressed more thanthe target subject emphasis sound, and the information processingapparatus further includes an output unit that outputs the comprehensivesound information, the intermediate sound information, and the targetsubject emphasis sound information, which are generated by thegeneration unit, in order of the comprehensive sound information, theintermediate sound information, and the target subject emphasis soundinformation in a case in which a frequency at which the observationdirection changes per unit time is equal to or more than a predeterminedfrequency.

A twelfth aspect according to the technology of the present disclosurerelates to the information processing apparatus according to any one ofthe first to eleventh aspects, in which the target subject emphasissound information is information indicating a sound including the targetsubject emphasis sound and not including the sound emitted from thedifferent position.

A thirteenth aspect according to the technology of the presentdisclosure relates to the information processing apparatus according toany one of the first to twelfth aspects, in which the specifying unitspecifies a positional relationship between the position of the targetsubject and the plurality of sound collection devices by using the soundcollection device position information and the target subject positioninformation, which are acquired by the acquisition unit, and the soundindicated by each of the plurality of pieces of sound information is asound adjusted to be smaller as the sound is positioned farther from theposition of the target subject depending on the positional relationshipspecified by the specifying unit.

A fourteenth aspect according to the technology of the presentdisclosure relates to the information processing apparatus according toany one of the first to thirteenth aspects, in which a virtual viewpointtarget subject image showing the target subject included in the virtualviewpoint video is an image that is in focus more than images in aperiphery of the virtual viewpoint target subject image in the virtualviewpoint video.

A fifteenth aspect according to the technology of the present disclosurerelates to the information processing apparatus according to any one ofthe first to fourteenth aspects, in which the sound collection deviceposition information is information indicating the position of the soundcollection device fixed in the imaging region.

A sixteenth aspect according to the technology of the present disclosurerelates to the information processing apparatus according to any one ofthe first to fourteenth aspects, in which at least one of the pluralityof sound collection devices is attached to the target subject.

A seventeenth aspect according to the technology of the presentdisclosure relates to the information processing apparatus according toany one of the first to fourteenth aspects, in which the plurality ofsound collection devices are attached to a plurality of objectsincluding the target subject in the imaging region.

An eighteenth aspect according to the technology of the presentdisclosure relates to an information processing method includingacquiring a plurality of pieces of sound information indicating soundsobtained by a plurality of sound collection devices, a sound collectiondevice position information indicating a position of each of theplurality of sound collection devices in an imaging region, and a targetsubject position information indicating a position of a target subjectin the imaging region, specifying a target sound of a regioncorresponding to the position of the target subject from the pluralityof pieces of sound information based on the acquired sound collectiondevice position information and the acquired target subject positioninformation, and generating target subject emphasis sound informationindicating a sound including a target subject emphasis sound in whichthe specified target sound is emphasized more than a sound emitted froma region different from the region corresponding to the position of thetarget subject indicated by the acquired target subject positioninformation based on viewpoint position information indicating aposition of a virtual viewpoint with respect to the imaging region,visual line direction information indicating a virtual visual linedirection with respect to the imaging region, angle-of-view informationindicating an angle of view with respect to the imaging region, and thetarget subject position information in a case in which a virtualviewpoint video is generated by using a plurality of images obtained byimaging the imaging region by a plurality of imaging apparatuses in aplurality of directions.

A nineteenth aspect according to the technology of the presentdisclosure relates to a program causing a computer to execute a processincluding acquiring a plurality of pieces of sound informationindicating sounds obtained by a plurality of sound collection devices, asound collection device position information indicating a position ofeach of the plurality of sound collection devices in an imaging region,and a target subject position information indicating a position of atarget subject in the imaging region, specifying a target sound of aregion corresponding to the position of the target subject from theplurality of pieces of sound information based on the acquired soundcollection device position information and the acquired target subjectposition information, and generating target subject emphasis soundinformation indicating a sound including a target subject emphasis soundin which the specified target sound is emphasized more than a soundemitted from a region different from the region corresponding to theposition of the target subject indicated by the acquired target subjectposition information based on viewpoint position information indicatinga position of a virtual viewpoint with respect to the imaging region,visual line direction information indicating a virtual visual linedirection with respect to the imaging region, angle-of-view informationindicating an angle of view with respect to the imaging region, and thetarget subject position information in a case in which a virtualviewpoint video is generated by using a plurality of images obtained byimaging the imaging region by a plurality of imaging apparatuses in aplurality of directions.

BRIEF DESCRIPTION OF THE DRAWINGS

Exemplary embodiments of the technology of the disclosure will bedescribed in detail based on the following figures, wherein:

FIG. 1 is a schematic perspective diagram showing an example of anexternal configuration of an information processing system according toan embodiment;

FIG. 2 is a schematic perspective diagram showing an example of anexternal configuration of a HMD provided in the information processingsystem according to the embodiment;

FIG. 3 is a conceptual diagram showing an example of a relationshipbetween an information processing apparatus provided in the informationprocessing system according to the embodiment and peripheral devicethereof;

FIG. 4A is a conceptual diagram showing a disposition example of aplurality of sound collection devices provided in the informationprocessing system according to the embodiment;

FIG. 4B is a conceptual diagram showing a first modification example ofthe disposition of the plurality of sound collection devices provided inthe information processing system according to the embodiment;

FIG. 4C is a conceptual diagram showing a second modification example ofthe disposition of the plurality of sound collection devices provided inthe information processing system according to the embodiment;

FIG. 5 is a block diagram showing an example of a hardware configurationof an electric system of the information processing apparatus accordingto the embodiment;

FIG. 6 is a block diagram showing an example of a hardware configurationof an electric system of a smartphone according to the embodiment;

FIG. 7 is a block diagram showing an example of a hardware configurationof an electric system of the HMD according to the embodiment;

FIG. 8 is a block diagram showing an example of a hardware configurationof an electric system of the sound collection device according to theembodiment;

FIG. 9 is a block diagram showing an example of a main function of theinformation processing apparatus according to the embodiment;

FIG. 10 is a conceptual diagram showing an example of an aspect in whicha viewpoint/visual line/angle-of-view indication is given to thesmartphone according to the embodiment;

FIG. 11 is a conceptual diagram provided for describing an example of aprocess content of a video generation unit of the information processingapparatus according to the embodiment;

FIG. 12 is a state transition diagram showing an example of an aspect ina case in which a viewpoint position and a visual line direction of avirtual viewpoint video generated by the video generation unit of theinformation processing apparatus according to the embodiment arechanged;

FIG. 13 is a state transition diagram showing an example of an aspect inwhich an angle of view of the virtual viewpoint video generated by thevideo generation unit of the information processing apparatus accordingto the embodiment is changed;

FIG. 14 is a conceptual diagram provided for describing examples ofprocess contents in the information processing apparatus and the HMDaccording to the embodiment;

FIG. 15 is a conceptual diagram showing an example of process content inwhich the virtual viewpoint video that is in focus with respect to atarget subject image is generated according to target subject positioninformation by the video generation unit of the information processingapparatus according to the embodiment, and the generated virtualviewpoint video is displayed on the HMD;

FIG. 16 is a block diagram provided for describing an example of processcontent of an acquisition unit and a specifying unit of the informationprocessing apparatus according to the embodiment;

FIG. 17 is a block diagram provided for describing examples of processcontents of a sound collection device side information acquisition unit,a target subject position information acquisition unit, the specifyingunit, and an adjustment sound information generation unit of theinformation processing apparatus according to the embodiment;

FIG. 18 is a conceptual diagram showing examples of process contents ofa first generation process and a second generation process executed bythe adjustment sound information generation unit of the informationprocessing apparatus according to the embodiment;

FIG. 19 is a block diagram showing an example of output of targetsubject emphasis sound information and integration sound informationgenerated by the adjustment sound information generation unit of theinformation processing apparatus according to the embodiment;

FIG. 20 is a flowchart showing an example of a flow of a videogeneration process according to the embodiment;

FIG. 21 is a flowchart showing an example of a flow of a soundgeneration process according to the embodiment;

FIG. 22 is a continuation of the flowchart shown in FIG. 21;

FIG. 23 is a conceptual diagram provided for describing an example ofprocess contents in the HMD and the information processing apparatus ina case in which an observation direction of a viewer is frequentlychanged;

FIG. 24 is a flowchart showing a first modification example of the flowof the sound generation process according to the embodiment;

FIG. 25 is a block diagram showing a modification example of the secondgeneration process executed by the adjustment sound informationgeneration unit of the information processing apparatus according to theembodiment;

FIG. 26 is a flowchart showing a second modification example of the flowof the sound generation process according to the embodiment;

FIG. 27 is a conceptual diagram provided for describing an example ofprocess contents in the HMD and the information processing apparatus ina case in which a plurality of the viewers mount the HMD;

FIG. 28 is a block diagram showing an example of a configuration of thesound collection device attached to a target subject;

FIG. 29 is a graph showing an example of a correlation between adistance from a target subject position to the sound collection deviceand a volume of a sound indicated by the sound information;

FIG. 30 is a conceptual diagram showing an example of an aspect in whicha field of view from the viewpoint position surrounds a referenceregion;

FIG. 31 is a conceptual diagram showing an example of an aspect in whichthe field of view from the viewpoint position is within the referenceregion;

FIG. 32 is a conceptual diagram showing an example of an aspect in whichthe field of view from the viewpoint position is out of the referenceregion;

FIG. 33 is a block diagram showing a modification example of theconfiguration of the HMD according to the embodiment; and

FIG. 34 is a block diagram showing an example of an aspect in which aninformation processing apparatus program according to the embodiment ina storage medium in which the information processing apparatus programis stored is installed in a computer of the information processingapparatus.

DETAILED DESCRIPTION

An example of an embodiment according to the technology of the presentdisclosure will be described with reference to the accompanyingdrawings.

First, the terms used in the following description will be described.

CPU refers to an abbreviation of “central processing unit”. RAM refersto an abbreviation of “random access memory”. DRAM refers to anabbreviation of “dynamic random access memory”. SRAM refers to anabbreviation of “static random access memory”. ROM refers to anabbreviation of “read only memory”. SSD refers to an abbreviation of“solid state drive”. HDD refers to an abbreviation of “hard disk drive”.EEPROM refers to an abbreviation of “electrically erasable andprogrammable read only memory”. I/F refers to an abbreviation of“interface”. IC refers to an abbreviation of “integrated circuit”. ASICrefers to an abbreviation of “application specific integrated circuit”.PLD refers to an abbreviation of “programmable logic device”. FPGArefers to an abbreviation of “field-programmable gate array”. SoC refersto an abbreviation of “system-on-a-chip”. CMOS refers to an abbreviationof “complementary metal oxide semiconductor”. CCD refers to anabbreviation of “charge coupled device”. EL refers to an abbreviation of“electro-luminescence”. GPU refers to an abbreviation of “graphicsprocessing unit”. LAN refers to an abbreviation of “local area network”.3D refers to an abbreviation of “3 dimension”. USB refers to anabbreviation of “universal serial bus”. HMD refers to an abbreviation of“head mounted display”. fps refers to an abbreviation of “frame persecond”. GPS refers to an abbreviation of “global positioning system”.In addition, in the description of the present specification, “same”means the same in the sense of including an error generally allowed inthe technical field to which the technology of the present disclosurebelongs, in addition to the exact same.

For example, as shown in FIG. 1, an information processing system 10comprises an information processing apparatus 12, a smartphone 14, aplurality of imaging apparatuses 16, an imaging apparatus 18, a wirelesscommunication base station (hereinafter, simply referred to as “basestation”) 20, and an HMD 34. Note that the number of the base stations20 is not limited to one, and a plurality of the base stations 20 may bepresent. Further, the communication standards used in the base station20 include a wireless communication standard including a Long TermEvolution (LTE) standard and a wireless communication standard includinga WiFi (802.11) standard and/or a Bluetooth (registered trademark)standard.

The imaging apparatuses 16 and 18 are devices for imaging having a CMOSimage sensor, and each have an optical zoom function and/or a digitalzoom function. Note that another type of image sensor, such as a CCDimage sensor, may be adopted instead of the CMOS image sensor.Hereinafter, for convenience of description, in a case in which adistinction is not necessary, the imaging apparatus 18 and the pluralityof imaging apparatuses 16, are referred to as “plurality of imagingapparatuses” without reference numeral.

The plurality of imaging apparatuses 16 are installed in a soccerstadium 22. Each of the plurality of imaging apparatuses 16 is disposedso as to surround a soccer field 24, and images a region including thesoccer field 24 as an imaging region in a plurality of directions. Here,an aspect example is described in which each of the plurality of imagingapparatuses 16 is disposed so as to surround the soccer field 24.However, the technology of the present disclosure is not limited tothis, and the disposition of the plurality of imaging apparatuses 16 isdecided depending on a virtual viewpoint video to be generated. Theplurality of imaging apparatuses 16 may be disposed so as to surroundthe whole soccer field 24, or the plurality of imaging apparatuses 16may be disposed so as to surround a specific part thereof. The imagingapparatus 18 is installed in an unmanned aerial vehicle (for example, amulti rotorcraft type unmanned aerial vehicle), and images the regionincluding the soccer field 24 as the imaging region in a bird's-eye viewfrom the sky. The imaging region of the region including the soccerfield 24 in a bird's-eye view from the sky refers to an imaging face onthe soccer field 24 by the imaging apparatus 18.

The information processing apparatus 12 is installed in a control room32. The plurality of imaging apparatuses 16 and the informationprocessing apparatus 12 are connected to each other via a LAN cable 30,and the information processing apparatus 12 controls the plurality ofimaging apparatuses 16 and acquires an image obtained by being imaged byeach of the plurality of imaging apparatuses 16. Note that although theconnection using a wired communication method by the LAN cable 30 isdescribed as an example here, the technology of the present disclosureis not limited to this, and the connection using a wirelesscommunication method may be used.

The base station 20 transmits and receives various pieces of informationto and from the information processing apparatus 12, the smartphone 14,the HMD 34, and the unmanned aerial vehicle 27 via the wirelesscommunication. That is, the information processing apparatus 12 isconnected to the smartphone 14, the HMD 34, and the unmanned aerialvehicle 27 via the base station 20 in the wirelessly communicablemanner. The information processing apparatus 12 controls the unmannedaerial vehicle 27 by wirelessly communicating with the unmanned aerialvehicle 27 via the base station 20, and acquires the image obtained bybeing imaged by the imaging apparatus 18 from the unmanned aerialvehicle 27.

The information processing apparatus 12 is a device corresponding to aserver, and the smartphone 14 and the HMD 34 are devices correspondingto a client terminal with respect to the information processingapparatus 12. Note that, in the following, in a case in which adistinction is not necessary, the smartphone 14 and the HMD 34 arereferred to as “terminal device” without reference numeral.

The information processing apparatus 12 and the terminal devicewirelessly communicate with each other via the base station 20, so thatthe terminal device requests the information processing apparatus 12 toprovide various services, and the information processing apparatus 12provides the services to the terminal device in response to the requestfrom the terminal device.

The information processing apparatus 12 acquires a plurality of theimages from the plurality of imaging apparatuses, and transmits a videogenerated based on the acquired plurality of images to the terminaldevice via the base station 20.

In the example shown in FIG. 1, a viewer 28 owns the smartphone 14, andthe HMD 34 is mounted on a head of the viewer 28. The video transmittedfrom the information processing apparatus 12 (hereinafter, also referredto as “distribution video”) is received by the terminal device, and thedistribution video received by the terminal device is visuallyrecognized by the viewer 28 through the terminal device. In the soccerstadium 22, spectator seats 26 are provided so as to surround the soccerfield 24. The viewer 28 may visually recognize the distribution video atthe spectator seat 26, or may visually recognize the distribution videoat a place (for example, at home) other than the spectator seat 26, anda place in which the viewer 28 visually recognizes the distributionvideo may be any place as long as the wireless communication with theinformation processing apparatus 12 is possible. Note that the viewer 28is an example of a “person” according to the technology of the presentdisclosure.

For example, as shown in FIG. 2, the HMD 34 comprises a body part 11A, amounting part 13A, and a speaker 158. The HMD 34 is mounted on theviewer 28. In a case in which the HMD 34 is mounted on the viewer 28,the body part 11A is positioned from the forehead to front of the viewer28, and the mounting part 13A is positioned in the upper half of thehead of the viewer 28. The speaker 158 is attached to the mounting part13A and is positioned on the left side head of the viewer 28.

The mounting part 13A is a band-shaped member having a width of aboutseveral centimeters, and comprises an inner ring 13A1 and an outer ring15A1. The inner ring 13A1 is formed in an annular shape and is fixed ina state of being closely attached to the upper half of the head of theviewer 28. The outer ring 15A1 is formed in a shape in which anoccipital side of the viewer 28 is cut out. The outer ring 15A1 bendsoutward from an initial position or shrinks inward from a bent statetoward the initial position depending on adjustment of a size of theinner ring 13A1.

The body part 11A comprises a protective frame 11A1, a computer 150, anda display 156. The computer 150 controls the whole HMD 34. Theprotective frame 11A1 is one transparent plate curved so as to cover thewhole both eyes of the viewer 28, and is made of, for example, plastichaving light transmittance.

The display 156 comprises a screen 156A and a projection unit 156B, andthe projection unit 156B is controlled by the computer 150. The screen156A is disposed inside the protective frame 11A1. The screen 156A isassigned to each of both eyes of the viewer 28. The screen 156A is madeof a transparent material similar to the protective frame 11A1. Theviewer 28 visually recognizes a real space via the screen 156A and theprotective frame 11A1 with the naked eye. That is, the HMD 34 is atransmission type HMD.

The screen 156A is positioned at a position facing the eyes of theviewer 28, and the distribution video is projected on an inner surfaceof the screen 156A (surface on the viewer 28 side) by the projectionunit 156B under the control of the computer 150. Since the projectionunit 156B is a well-known device, the detailed description thereof willbe omitted. However, the projection unit 156B is a device including adisplay element, such as a liquid crystal, which displays thedistribution video and projection optical system that projects thedistribution video displayed on the display element toward the innersurface of the screen 156A. The screen 156A is realized by using a halfmirror that reflects the distribution video projected by the projectionunit 156B and transmits the light in the real space. The projection unit156B projects the distribution video on the inner surface of the screen156A at a predetermined frame rate (for example, 60 fps). Thedistribution video is reflected by the inner surface of the screen 156Aand is incident on the eyes of the viewer 28. As a result, the viewer 28visually recognizes the distribution video. Note that the half mirrorhas been described as an example of the screen 156A here, but thetechnology of the present disclosure is not limited to this, and thescreen 156A itself may be used as the display element, such as theliquid crystal. In addition, in addition to the screen projection typeHMD shown here, a retina projection HMD that directly irradiates theretina of the eyes of the viewer 28 with a laser may be adopted.

The speaker 158 is connected to the computer 150 and outputs the soundunder the control of the computer 150. That is, under the control of thecomputer 150, the speaker 158 receives an electric signal indicating thesound, converts the received electric signal into the sound, and outputsthe converted sound, so that audible display of various pieces ofinformation is realized. Here, the speaker 158 is integrated with thecomputer 150, but the sound output by a separate headphone (includingearphones) connected to the computer 150 by wire or wirelessly may beadopted.

For example, as shown in FIG. 3, the information processing apparatus 12acquires a bird's-eye view video 46A showing the region including thesoccer field 24 in a case of being observed from the sky from theunmanned aerial vehicle 27. The bird's-eye view video 46A is a movingimage obtained by imaging the region including the soccer field 24 asthe imaging region (hereinafter, also simply referred to as “imagingregion”) in a bird's-eye view from the sky by the imaging apparatus 18of the unmanned aerial vehicle 27. Note that, here, although a case inwhich the bird's-eye view video 46A is the moving image is described asan example, the bird's-eye view video 46A is not limited to this, andmay be a still image showing the region including the soccer field 24 ina case of being observed from the sky.

The information processing apparatus 12 acquires an imaging video 46Bshowing the imaging region in a case of being observed from eachposition of the plurality of imaging apparatuses 16 from each of theplurality of imaging apparatuses 16. The imaging video 46B is a movingimage obtained by imaging the imaging region by each of the plurality ofimaging apparatuses 16 in the plurality of directions. Note that, here,although a case in which the imaging video 46B is the moving image isdescribed as an example, the imaging video 46B is not limited to this,and may be a still image showing the imaging region in a case of beingobserved from each position of the plurality of imaging apparatuses 16.

The bird's-eye view video 46A and the imaging video 46B are videosobtained by imaging the images in the plurality of directions in whichthe regions including the soccer field 24 are different from each other,and are examples of “a plurality of images” according to the technologyof the present disclosure.

The information processing apparatus 12 generates a virtual viewpointvideo 46 by using the bird's-eye view video 46A and the imaging video46B. The virtual viewpoint video 46 is video showing the imaging regionin a case in which the imaging region is observed from a viewpointposition and a visual line direction different from a viewpoint positionand a visual line direction of each of the plurality of imagingapparatuses. In the example shown in FIG. 3, the virtual viewpoint video46 refers to the virtual viewpoint video showing the imaging region in acase in which the imaging region is observed from a viewpoint position42 and a visual line direction 44 in a spectator seat 26. Examples ofthe virtual viewpoint video 46 include a moving image using 3D polygons.

The moving image is described as an example of the virtual viewpointvideo 46 here, but the technology of the present disclosure is notlimited to this, and a still image using the 3D polygons may be used.Here, an aspect example is described in which the bird's-eye view video46A obtained by being imaged by the imaging apparatus 18 is also usedfor generating the virtual viewpoint video 46, but the technology of thepresent disclosure is not limited to this. For example, the bird's-eyeview video 46A is not used for generating the virtual viewpoint video46, and only a plurality of the imaging videos 46B obtained by beingimaged by the plurality of imaging apparatuses 16 may be used forgenerating the virtual viewpoint video 46. That is, the virtualviewpoint video 46 may be generated only from the videos obtained bybeing imaged by the plurality of imaging apparatuses 16 without usingthe video obtained by the imaging apparatus 18 (for example, a multirotorcraft type unmanned aerial vehicle). Note that in a case in whichthe video obtained from the imaging apparatus 18 (for example, a multirotorcraft type unmanned aerial vehicle) is used, a more accuratevirtual viewpoint video can be generated.

The information processing apparatus 12 selectively transmits thebird's-eye view video 46A, the imaging video 46B, and the virtualviewpoint video 46 as the distribution video to the terminal device.

For example, as shown in FIG. 4A, the information processing system 10comprises a plurality of sound collection devices 100. The soundcollection device 100 performs the sound collection. Here, collectingthe sound refers to capturing the sound, that is, the sound collection.In addition, the sound collection device 100 transmits sound informationindicating the captured sound, that is, the collected sound. Theplurality of sound collection devices 100 are present in the imagingregion, and the installation positions of the plurality of soundcollection devices 100 are fixed in the imaging region. In the presentembodiment, “presence” refers to, for example, presence in a state ofbeing spaced in a regular disposition. Note that the meaning of“presence” in the technology of the present disclosure also includes themeaning of presence in a state of being scattered irregularly orregularly.

In addition, in the example shown in FIG. 4A, the plurality of soundcollection devices 100 are scattered in the imaging region, but theplurality of sound collection devices 100 do not necessarily have to bescattered in the imaging region. For example, the plurality of soundcollection devices 100 may be aligned without gaps. In addition, theplurality of sound collection devices 100 do not necessarily have to bepresent in the imaging region. For example, as shown in FIGS. 4B and 4C,the plurality of sound collection devices 100 may be present outside theimaging region and perform the sound collection in the imaging region bya microphone having high directivity. In the example shown in FIG. 4B,the sound is collected by the plurality of sound collection devices 100that are present in the imaging region and the plurality of soundcollection devices 100 that are present outside the imaging region. Inaddition, in the example shown in FIG. 4C, the sound collection devices100 are not present in the imaging region, and the plurality of soundcollection devices 100 are present outside the imaging region, and theplurality of sound collection devices having directivity in the imagingregion collect the sound in the imaging region.

In the example shown in FIG. 4A, the plurality of sound collectiondevices 100 are embedded in the soccer field 24 in a matrix.Specifically, the sound collection devices 100 are disposed atpredetermined intervals (for example, at intervals of 5 meters) from oneend to the other end of a side line and from one end to the other end ofa goal line. In the example shown in FIG. 4A, 35 sound collectiondevices 100 are disposed in a matrix in the soccer field 24, but thenumber of the sound collection devices 100 is not limited to this, andneed only be plural. In addition, the plurality of sound collectiondevices 100 do not need to be disposed in a matrix. For example, theplurality of sound collection devices 100 may be disposedconcentrically, spirally, or the like, and need only be present in thesoccer field 24.

The plurality of sound collection devices 100 are connected to theinformation processing apparatus 12 via the base station 20 in awirelessly communicable manner. Each of the plurality of soundcollection devices 100 exchanges various pieces of information with theinformation processing apparatus 12 by performing the wirelesscommunication with the information processing apparatus 12 via the basestation 20. For example, each of the plurality of sound collectiondevices 100 transmits the sound information to the informationprocessing apparatus 12 in response to a request from the informationprocessing apparatus 12. The information processing apparatus 12generates adjustment sound information based on a plurality of pieces ofthe sound information transmitted from the plurality of sound collectiondevices 100. The adjustment sound information is information indicatingan adjustment sound obtained by adjusting at least a partial sound ofthe plurality of sounds indicated by the plurality of pieces of soundinformation. The information processing apparatus 12 transmits thegenerated and obtained adjustment sound information to the HMD 34. TheHMD 34 receives the adjustment sound information transmitted from theinformation processing apparatus 12 to output the adjustment soundindicated by the received adjustment sound information from the speaker158.

For example, as shown in FIG. 5, the information processing apparatus 12comprises a computer 50, a reception device 52, a display 53, a firstcommunication I/F 54, and a second communication I/F 56. The computer 50comprises a CPU 58, a storage 60, and a memory 62, and the CPU 58, thestorage 60, and the memory 62 are connected to each other via a bus line64. In the example shown in FIG. 5, for convenience of illustration, onebus line is shown as the bus line 64, but a data bus, an address bus, acontrol bus, and the like are included in the bus line 64.

The CPU 58 controls the whole information processing apparatus 12.Various parameters and various programs are stored in the storage 60.The storage 60 is a non-volatile storage device. Here, a flash memory isadopted as an example of the storage 60, but the technology of thepresent disclosure is not limited to this, and an EEPROM, an HDD, anSSD, or the like may be used. The memory 62 is a volatile storagedevice. Various pieces of information are transitorily stored in thememory 62. The memory 62 is used as a work memory by the CPU 58. Here,an RAM is adopted as an example of the memory 62, but the technology ofthe present disclosure is not limited to this, and another type ofvolatile storage device may be used.

The reception device 52 receives the instruction from a user or the likeof the information processing apparatus 12. Examples of the receptiondevice 52 include a touch panel, a hard key, and a mouse. The receptiondevice 52 is connected to the bus line 64, and the CPU 58 acquires theinstruction received by the reception device 52. The display 53 isconnected to the bus line 64 and displays various pieces of informationunder the control of the CPU 58. Examples of the display 53 include aliquid crystal display. Note that another type of display, such as anorganic EL display or an inorganic EL display, may be adopted as thedisplay 53 without being limited to the liquid crystal display.

The first communication I/F 54 is connected to the LAN cable 30. Thefirst communication I/F 54 is realized by, for example, a deviceconfigured by circuits (for example, an ASIC, an FPGA, and/or a PLD).The first communication I/F 54 is connected to the bus line 64 andcontrols the exchange of various pieces of information between the CPU58 and the plurality of imaging apparatuses 16. For example, the firstcommunication I/F 54 controls the plurality of imaging apparatuses 16 inresponse to the request of the CPU 58. In addition, the firstcommunication I/F 54 acquires the imaging video 46B (see FIG. 3)obtained by being imaged by each of the plurality of imaging apparatuses16, and outputs the acquired imaging video 46B to the CPU 58.

The second communication I/F 56 is connected to the base station 20 inthe wirelessly communicable manner. The second communication I/F 56 isrealized by, for example, a device configured by circuits (for example,an ASIC, an FPGA, and/or a PLD). The second communication I/F 56 isconnected to the bus line 64. The second communication I/F 56 controlsthe exchange of various pieces of information between the CPU 58 and theunmanned aerial vehicle 27 by the wireless communication method via thebase station 20. In addition, the second communication I/F 56 controlsthe exchange of various pieces of information between the CPU 58 and thesmartphone 14 by the wireless communication method via the base station20. In addition, the second communication I/F 56 controls the exchangeof various pieces of information between the CPU 58 and the HMD 34 bythe wireless communication method via the base station 20. In addition,the second communication I/F 56 controls the exchange of various piecesof information between the CPU 58 and the plurality of sound collectiondevices 100 by the wireless communication method via the base station20.

For example, as shown in FIG. 6, the smartphone 14 comprises a computer70, a reception device 76, a display 78, a microphone 80, a speaker 82,an imaging apparatus 84, and a communication I/F 86. The computer 70comprises a CPU 88, a storage 90, and a memory 92, and the CPU 88, thestorage 90, and the memory 92 are connected to each other via a bus line94. In the example shown in FIG. 6, for convenience of illustration, onebus line is shown as the bus line 94. However, the bus line 94 isconfigured by a serial bus or is configured to include a data bus, anaddress bus, a control bus, and the like. In addition, in the exampleshown in FIG. 6, the CPU 88, the reception device 76, the display 78,the microphone 80, the speaker 82, the imaging apparatus 84, and thecommunication I/F 86 are connected by a common bus, but the CPU 88 andeach device may be connected by a dedicated bus or a dedicatedcommunication line.

The CPU 88 controls the whole smartphone 14. Various parameters andvarious programs are stored in the storage 90. The storage 90 is anon-volatile storage device. Here, an EEPROM is adopted as an example ofthe storage 90, but the technology of the present disclosure is notlimited to this, and a mask ROM, an HDD, an SSD, or the like may beused. Various pieces of information are transitorily stored in thememory 92, and the memory 92 is used as a work memory by the CPU 88.Here, a DRAM is adopted as an example of the memory 92, but thetechnology of the present disclosure is not limited to this, and anothertype of the storage device, such as an SRAM, may be used.

The reception device 76 receives the instruction from the viewer 28.Examples of the reception device 76 include a touch panel 76A, and ahard key. The reception device 76 is connected to the bus line 94, andthe CPU 88 acquires the instruction received by the reception device 76.

The display 78 is connected to the bus line 94 and displays variouspieces of information under the control of the CPU 88. Examples of thedisplay 78 include a liquid crystal display. Note that another type ofdisplay, such as an organic EL display, may be adopted as the display 78without being limited to the liquid crystal display.

The smartphone 14 comprises a touch panel display, and the touch paneldisplay is realized by the touch panel 76A and the display 78. That is,the touch panel display is formed by superimposing the touch panel 76Aon a display region of the display 78. In addition, in the presentembodiment, the touch panel 76A is provided independently, but the touchpanel 28 may be a so-called in-cell type touch panel built in thedisplay 78.

The microphone 80 performs the sound collection (collects sound) andconverts the collected sound into the electric signal. The microphone 80is connected to the bus line 94. The CPU 88 acquires the electric signalobtained by converting the sound collected by the microphone 80 via thebus line 94.

The speaker 82 converts the electric signal into the sound. The speaker82 is connected to the bus line 94. The speaker 82 receives the electricsignal output from the CPU 88 via the bus line 94, converts the receivedelectric signal into the sound, and outputs the sound obtained byconverting the electric signal to the outside of the smartphone 14. Theimaging apparatus 84 acquires an image showing a subject by imaging thesubject. The imaging apparatus 84 is connected to the bus line 94. Theimage obtained by imaging the subject by the imaging apparatus 84 isacquired by the CPU 88 via the bus line 94.

The communication I/F 86 is connected to the base station 20 in thewirelessly communicable manner. The communication I/F 86 is realized by,for example, a device configured by circuits (for example, an ASIC, anFPGA, and/or a PLD). The communication I/F 86 is connected to the busline 94. The communication I/F 86 controls the exchange of variouspieces of information between the CPU 88 and an external device by thewireless communication method via the base station 20. Here, examples ofthe “external device” include the information processing apparatus 12,the unmanned aerial vehicle 27, and the HMD 34.

For example, as shown in FIG. 7, the HMD 34 is an example of a “displaydevice” according to the technology of the present disclosure, andcomprises the computer 150, a reception device 152, the display 156, amicrophone 157, the speaker 158, an eye tracker 166, and a communicationI/F 168.

The computer 150 comprises a CPU 160, a storage 162, and a memory 164,and the CPU 160, the storage 162, and the memory 164 are connected via abus line 170. In the example shown in FIG. 7, for convenience ofillustration, one bus line is shown as the bus line 170, but a data bus,an address bus, a control bus, and the like are included in the bus line170.

The CPU 160 controls the whole HMD 34. Various parameters and variousprograms are stored in the storage 162. The storage 162 is anon-volatile storage device. Here, an EEPROM is adopted as an example ofthe storage 162, but the technology of the present disclosure is notlimited to this, and a mask ROM, an HDD, an SSD, or the like may beused. The memory 164 is a volatile storage device. Various pieces ofinformation are transitorily stored in the memory 164, and the memory164 is used as a work memory by the CPU 160. Here, a DRAM is adopted asan example of the memory 164, but the technology of the presentdisclosure is not limited to this, and another type of volatile storagedevice, such as an SRAM, may be used.

The reception device 152 receives the instruction from the viewer 28.Examples of the reception device 152 include a remote controller and/ora hard key. The reception device 152 is connected to the bus line 170,and the CPU 160 acquires the instruction received by the receptiondevice 152.

The display 156 is a display that can display the distribution videovisually recognized by the viewer 28. The display 156 is connected tothe bus line 170 and displays various pieces of information under thecontrol of the CPU 160.

The microphone 157 performs the sound collection (collects sound) andconverts the collected sound into the sound information which is theelectric signal. The microphone 157 is connected to the bus line 170.The CPU 160 acquires the sound information obtained by converting thesound collected by the microphone 157 via the bus line 170.

The speaker 158 converts the electric signal into the sound. The speaker158 is connected to the bus line 170. The speaker 158 receives theelectric signal output from the CPU 160 via the bus line 170, convertsthe received electric signal into the sound, and outputs the soundobtained by converting the electric signal to the outside of the HMD 34.

The eye tracker 166 comprises an imaging element 166A. Here, a CMOSimage sensor is adopted as the imaging element 166A. Note that theimaging element 166A is not limited to the CMOS image sensor, andanother type of image sensor, such as a CCD image sensor, may beadopted. The eye tracker 166 uses the imaging element 166A to image botheyes of the viewer 28 depending on a predetermined frame rate (forexample, 60 fps). The eye tracker detects the visual line direction ofthe viewer 28 (hereinafter, also simply referred to as the “visual linedirection”) based on an eye image (image showing the eyes of the viewer28) obtained by imaging both eyes of the viewer 28.

That is, the eye tracker 166 detects the visual line direction based onthe image obtained by imaging by the imaging element 166A as theobservation direction (hereinafter, also simply referred to as the“observation direction”) of the viewer 28 who observes the targetsubject image (hereinafter, also simply referred to as the “targetsubject image”) showing the target subject in the distribution video ina state in which the distribution video (for example, the virtualviewpoint video 46) is displayed on the display 156. Note that the eyetracker 166 is an example of a “detection unit (detector)” according tothe technology of the present disclosure.

The communication I/F 168 is connected to the base station 20 in awirelessly communicable manner. The communication I/F 168 is realizedby, for example, a device configured by circuits (for example, an ASIC,an FPGA, and/or a PLD). The communication I/F 168 is connected to thebus line 170. The communication I/F 168 controls the exchange of variouspieces of information between the CPU 160 and an external device by thewireless communication method via the base station 20. Here, examples ofthe “external device” include the information processing apparatus 12,the unmanned aerial vehicle 27, and the smartphone 14.

For example, as shown in FIG. 8, the sound collection device 100comprises a computer 200, a microphone 207, and a communication I/F 218.The computer 200 comprises a CPU 210, a storage 212, and a memory 214,and the CPU 210, the storage 212, and the memory 214 are connected via abus line 220. In the example shown in FIG. 8, for convenience ofillustration, one bus line is shown as the bus line 220, but a data bus,an address bus, a control bus, and the like are included in the bus line220.

The CPU 210 controls the whole sound collection device 100. Variousparameters and various programs are stored in the storage 212. Thestorage 212 is a non-volatile storage device. Here, an EEPROM is adoptedas an example of the storage 212, but the technology of the presentdisclosure is not limited to this, and a mask ROM, an HDD, an SSD, orthe like may be used. The memory 214 is a volatile storage device.Various pieces of information are transitorily stored in the memory 214,and the memory 214 is used as a work memory by the CPU 210. Here, a DRAMis adopted as an example of the memory 214, but the technology of thepresent disclosure is not limited to this, and another type of volatilestorage device, such as an SRAM, may be used.

The microphone 207 performs the sound collection (collects sound) andconverts the collected sound into the electric signal. The microphone207 is connected to the bus line 220. The CPU 210 acquires the electricsignal obtained by converting the sound collected by the microphone 207via the bus line 220.

The communication I/F 218 is connected to the base station 20 in thewirelessly communicable manner. The communication I/F 218 is realizedby, for example, a device configured by circuits (an ASIC, an FPGA,and/or a PLD). The communication I/F 218 is connected to the bus line220. The communication I/F 218 controls the exchange of various piecesof information between the CPU 210 and the information processingapparatus 12 by the wireless communication method via the base station20.

For example, as shown in FIG. 9, in the information processing apparatus12, the storage 60 stores a video generation program 60A and a soundgeneration program 60B. Note that, in the following, in a case in whicha distinction is not necessary, the video generation program 60A and thesound generation program 60B are referred to as a “informationprocessing apparatus program” without reference numeral.

The CPU 58 is an example of a “processor” according to the technology ofthe present disclosure, and the memory 62 is an example of a “memory”according to the technology of the present disclosure. The CPU 58 readsout the information processing apparatus program from the storage 60,and expands the readout information processing apparatus program in thememory 62. The CPU 58 controls the whole information processingapparatus 12 according to the information processing apparatus programexpanded in the memory 62, and exchanges various pieces of informationwith the plurality of imaging apparatuses, the unmanned aerial vehicle27, the terminal device, and the plurality of sound collection devices100.

The CPU 58 reads out the video generation program 60A from the storage60, and expands the readout video generation program 60A in the memory62. The CPU 58 is operated as a video generation unit 58A and anacquisition unit 58B according to the video generation program 60Aexpanded in the memory 62. The CPU 58 is operated as the videogeneration unit 58A and the acquisition unit 58B to execute a videogeneration process (see FIG. 20), which will be described below.

The CPU 58 reads out the sound generation program 60B from the storage60, and expands the readout sound generation program 60B in the memory62. The CPU 58 is operated as the acquisition unit 58B, a specifyingunit 58C, an adjustment sound information generation unit 58D, and anoutput unit 58E according to the sound generation program 60B expandedin the memory 62. The CPU 58 is operated as the acquisition unit 58B,the specifying unit 58C, the adjustment sound information generationunit 58D, and the output unit 58E to execute a sound generation process(see FIGS. 21 and 22) described below. Note that the adjustment soundinformation generation unit 58D is an example of a “generation unit”according to the technology of the present disclosure.

For example, as shown in FIG. 10, the information processing apparatus12 transmits the bird's-eye view video 46A to the smartphone 14. Thesmartphone 14 receives the bird's-eye view video 46A transmitted fromthe information processing apparatus 12. The bird's-eye view video 46Areceived by the smartphone 14 is displayed on the display 78 of thesmartphone 14.

In a state in which the bird's-eye view video 46A is displayed on thedisplay 78, the viewer 28 selectively gives a viewpoint indication, avisual line indication, and an angle-of-view indication to thesmartphone 14. The viewpoint indication refers to an indication of aposition of a virtual viewpoint with respect to the imaging region(hereinafter, referred to as the “virtual viewpoint”). The visual lineindication refers to an indication of a direction of a virtual visualline with respect to the imaging region (hereinafter, referred to as the“virtual visual line”). The angle-of-view indication refers to anindication of an angle of view with respect to the imaging region(hereinafter, simply referred to as the “angle of view”). Hereinafter,for convenience of description, in a case in which a distinction is notnecessary, the viewpoint indication, the visual line indication, and theangle-of-view indication are referred to as a “viewpoint/visualline/angle-of-view indication”. The position of the virtual viewpoint isalso referred to as a “virtual viewpoint position”. In addition, thedirection of the “virtual visual line” is also referred to as a “virtualvisual line direction”.

Examples of the viewpoint indication include a touch operation on thetouch panel 76A. Instead of the touch operation, a tap operation or adouble tap operation may be used. Examples of the visual line indicationinclude a slide operation on the touch panel 76A. Instead of the slideoperation, a flick operation may be used. Examples of the angle-of-viewindication include a pinch operation on the touch panel 76A. The pinchoperation is roughly classified into a pinch-in operation and apinch-out operation. The pinch-in operation is an operation performed ina case in which the angle of view is widened, and the pinch-outoperation is an operation performed in a case in which the angle of viewis narrowed.

Viewpoint information indicating the virtual viewpoint position asindicated by the viewpoint indication, visual line direction informationindicating the virtual visual line direction as indicated by the visualline indication, and angle-of-view information indicating the angle ofview as indicated by the angle-of-view indication are transmitted to theinformation processing apparatus 12 by the CPU 88 of the smartphone 14.Note that in the following, for convenience of description, in a case inwhich a distinction is not necessary, the viewpoint information, thevisual line direction information, and the angle-of-view information arereferred to as a “viewpoint/visual line/angle-of-view information”.

The viewpoint/visual line/angle-of-view information transmitted by theCPU 88 of the smartphone 14 is received by the video generation unit58A, and the angle-of-view information transmitted by the CPU 88 of thesmartphone 14 is received by the adjustment sound information generationunit 58D.

For example, as shown in FIG. 11, the video generation unit 58A acquiresthe bird's-eye view video 46A from the unmanned aerial vehicle 27, andacquires the imaging video 46B from each of the plurality of imagingapparatuses 16. The bird's-eye view video 46A is provided with firstposition association information, and the imaging video 46B is providedwith second position association information.

The first position association information is information indicating acorrespondence between a position in the imaging region and a positionin the bird's-eye view video 46A (for example, a position of a pixel).In the first position association information, position-in-imagingregion specification information (for example, a three-dimensionalcoordinate) for specifying the position in the imaging region andposition-in-bird's-eye view video specification information forspecifying the position in the bird's-eye view video 46A are associatedwith each other. Note that, for example, as shown in FIG. 11, theimaging region is a rectangular parallelepiped three-dimensional regionwith the soccer field 24 as the bottom plane, and theposition-in-imaging region specification information is expressed by thethree-dimensional coordinate with one of four corners of the soccerfield 24 as an origin 24A.

The second position association information is information indicating acorrespondence between the position in the imaging region and theposition in the imaging video 46B (for example, the position of thepixel). In the second position association information, theposition-in-imaging region specification information (for example, thethree-dimensional coordinate) for specifying the position in the imagingregion and position-in-imaging video specification information forspecifying the position in the imaging video 46B are associated witheach other.

The video generation unit 58A generates the virtual viewpoint video 46by using the bird's-eye view video 46A acquired from the unmanned aerialvehicle 27 and the imaging video 46B acquired from each of the pluralityof imaging apparatuses 16 based on the viewpoint/visualline/angle-of-view information. The virtual viewpoint video 46 isprovided with third position association information. The third positionassociation information is information indicating a correspondencebetween the position in the imaging region and the position in thevirtual viewpoint video 46 (for example, the position of the pixel), andis an example of “correspondence information” according to thetechnology of the present disclosure. The third position associationinformation is generated by the video generation unit 58A based on thefirst position association information and the second positionassociation information.

Note that, here, since the virtual viewpoint video 46 is generated, thethird position association information is an example of the“correspondence information” according to the technology of the presentdisclosure. However, in a case in which the virtual viewpoint video 46is not generated by the video generation unit 58A and the bird's-eyeview video 46A is used as it is instead of the virtual viewpoint video46, the first position association information is an example of the“correspondence information” according to the technology of the presentdisclosure. In addition, in a case in which the virtual viewpoint video46 is not generated by the video generation unit 58A and the imagingvideo 46B is used as it is instead of the virtual viewpoint video 46,the second position association information is an example of the“correspondence information” according to the technology of the presentdisclosure. For example, as shown in FIG. 12, in a case in which theviewpoint information and the visual line direction information arechanged, the video generation unit 58A regenerates the virtual viewpointvideo 46 with the changes in the viewpoint information and the visualline direction information. In a case in which the virtual viewpointvideo 46 is regenerated by the video generation unit 58A according tothe viewpoint information and the visual line direction information, thethird position association information is also regenerated by the videogeneration unit 58A based on the first position association informationand the second position association information. Moreover, theregenerated third position association information is provided to thelatest virtual viewpoint video 46 by the video generation unit 58A.

For example, as shown in FIG. 13, in a case in which the angle-of-viewinformation is changed, the video generation unit 58A regenerates thevirtual viewpoint video 46 with the change in the angle-of-viewinformation. In a case in which the virtual viewpoint video 46 isregenerated by the video generation unit 58A according to theangle-of-view information, the third position association information isalso regenerated by the video generation unit 58A based on the firstposition association information and the second position associationinformation. Moreover, the regenerated third position associationinformation is provided to the latest virtual viewpoint video 46 by thevideo generation unit 58A.

For example, as shown in FIG. 14, the video generation unit 58Atransmits the virtual viewpoint video 46 and the third positionassociation information to the HMD 34. In the HMD 34, the CPU 160receives the virtual viewpoint video 46 and the third positionassociation information transmitted from the video generation unit 58A,and displays the received virtual viewpoint video 46 on the display 156.

Here, the imaging element 166A images eyes 29 of the viewer 28 in astate in which the virtual viewpoint video 46 is displayed on thedisplay 156. The eye tracker 166 detects the observation direction basedon the eye image obtained by imaging the eyes 29 by the imaging element166A, and outputs observation direction specification information forspecifying the detected observation direction to the CPU 160.

The CPU 160 specifies a position at which the viewer 28 directsattention (hereinafter, referred to as an “attention position”) in thedisplay 156 (specifically, the screen 156A shown in FIG. 2) based on theobservation direction specification information and position-in-virtualviewpoint video specification information included in the third positionassociation information. Moreover, the CPU 160 derives target subjectposition information based on the specified attention position and thethird position association information.

The target subject position information includes subjectposition-in-imaging region information and subject position-in-virtualviewpoint video information. The subject position-in-imaging regioninformation is information indicating the position of the target subjectin the imaging region (hereinafter, also referred to as a “targetsubject position”). Here, as an example of the subjectposition-in-imaging region information, the three-dimensional coordinatefor specifying the target subject position in the imaging region isadopted. The subject position-in-virtual viewpoint video information isinformation (for example, an address for specifying the position of thepixel) for specifying the position of a target subject image 47 in thevirtual viewpoint video 46 (hereinafter, also referred to as a “targetsubject image position”). The target subject position information isinformation in which the subject position-in-imaging region informationand the subject position-in-virtual viewpoint video information areassociated with each other in a state in which the correspondencebetween the target subject position and the target subject imageposition can be specified.

The CPU 160 derives the target subject position information based on thethird position association information and a detection result by the eyetracker 166, that is, the observation direction specificationinformation. Specifically, the CPU 160 acquires the position-in-imagingregion specification information and the position-in-virtual viewpointvideo specification information corresponding to the attention positionas the target subject position information from the third positionassociation information. The CPU 160 transmits the acquired targetsubject position information to the information processing apparatus 12.

In the information processing apparatus 12, the acquisition unit 58Bcomprises a target subject position information acquisition unit 58B1.The target subject position information acquisition unit 58B1 acquiresthe target subject position information. Here, the target subjectposition information transmitted from the CPU 160 of the HMD 34 isacquired by being received by the target subject position informationacquisition unit 58B1. For example, as shown in FIG. 15, the targetsubject position information acquisition unit 58B1 outputs the targetsubject position information to the video generation unit 58A. Moreover,the video generation unit 58A generates the virtual viewpoint video 46by using the bird's-eye view video 46A and the imaging video 46B basedon the viewpoint/visual line/angle-of-view information and the targetsubject position information described above. Specifically, the videogeneration unit 58A generates the virtual viewpoint video 46 that is infocus with respect to the target subject image position specified by theposition-in-virtual viewpoint video specification information includedin the target subject position information input by the target subjectposition information acquisition unit 58B1. That is, the videogeneration unit 58A generates the virtual viewpoint video 46 that is infocus with respect to the target subject image 47 more than the image ina periphery of the target subject image 47. Here, a state in which thetarget subject image 47 is in focus more than the images in a peripheryof the target subject image 47 means that a contrast value of the targetsubject image 47 is higher than contrast values of the images in aperiphery of the target subject image 47.

For example, as shown in FIG. 15, the virtual viewpoint video 46 isroughly classified into a focused region in which the target subjectimage 47 is positioned and a peripheral region of the target subjectimage 47, that is, a non-focused region having lower contrast value thanthe focused region. Here, the target subject image 47 is an example of a“virtual viewpoint target subject image” according to the technology ofthe present disclosure. The virtual viewpoint video 46 having thefocused region and the non-focused region is transmitted to the HMD 34by the video generation unit 58A in a state in which the third positionassociation information is provided. Moreover, in the HMD 34, the CPU160 receives the virtual viewpoint video 46 and the third positionassociation information transmitted from the video generation unit 58A.Moreover, the CPU 160 displays the received virtual viewpoint video 46on the display 156.

For example, as shown in FIG. 16, the acquisition unit 58B comprises asound collection device side information acquisition unit 58B2 inaddition to the target subject position information acquisition unit58B1. The target subject position information acquisition unit 58B1outputs target subject position specification information acquired fromthe HMD 34 to the specifying unit 58C.

The sound collection device 100 transmits the sound information andsound collection position specification information indicating theposition of the sound collection device 100 in the imaging region(hereinafter, also referred to as a “sound collection device position”)to the information processing apparatus 12. Here, as an example of thesound collection position specification information, thethree-dimensional coordinate for specifying the sound collection deviceposition in the imaging region is adopted. Note that the soundcollection position specification information is an example of “soundcollection device position information” according to the technology ofthe present disclosure.

In the information processing apparatus 12, the sound collection deviceside information acquisition unit 58B2 acquires the sound informationand the sound collection position specification information. Here, thesound information and the sound collection position specificationinformation transmitted from the sound collection device 100 areacquired by being received by the sound collection device sideinformation acquisition unit 58B2. The sound collection device sideinformation acquisition unit 58B2 generates sound collection deviceinformation based on the sound information acquired from the soundcollection device 100 and the sound collection position specificationinformation. The sound collection device information is information inwhich the sound information and the sound collection positionspecification information are associated with each other for each soundcollection device 100. The sound collection device side informationacquisition unit 58B2 outputs the generated sound collection deviceinformation to the specifying unit 58C.

For example, as shown in FIG. 17, the specifying unit 58C acquires thetarget subject position information from the target subject positioninformation acquisition unit 58B1 and acquires the sound collectiondevice information from the sound collection device side informationacquisition unit 58B2. Moreover, the specifying unit 58C specifies thetarget sound in the region corresponding to the target subject from theplurality of pieces of sound information based on the target subjectposition information and the sound collection device information.

The specifying unit 58C acquires the sound collection device informationfor each of the plurality of sound collection devices 100 from the soundcollection device side information acquisition unit 58B2. That is, thespecifying unit 58C acquires a plurality of pieces of the soundcollection device information from the sound collection device sideinformation acquisition unit 58B2. The specifying unit 58C specifies thesound collection device information having the sound collection positionspecification information corresponding to the subjectposition-in-imaging region information included in the target subjectposition information from the plurality of pieces of sound collectiondevice information. Here, the sound collection position specificationinformation corresponding to the subject position-in-imaging regioninformation refers to the sound collection position specificationinformation for specifying the sound collection device position closestto the target subject position specified by the subjectposition-in-imaging region information among a plurality of the soundcollection device positions indicated by a plurality of pieces of thesound collection position specification information included in theplurality of pieces of sound collection device information.

The specifying unit 58C specifies the sound information included in thespecified sound collection device information as the target soundinformation indicating the target sound in the region corresponding tothe target subject position.

The adjustment sound information generation unit 58D acquires the targetsound information specified by the specifying unit 58C from thespecifying unit 58C, and acquires the sound collection deviceinformation for each of the plurality of sound collection devices 100from the sound collection device side information acquisition unit 58B2.The adjustment sound information generation unit 58D generates theadjustment sound information based on the acquired target soundinformation and the acquired sound collection device information. Theadjustment sound information is roughly classified into integrationsound information and target subject emphasis sound information. Theintegration sound information is an example of “integration soundinformation” and “comprehensive sound information” according to thetechnology of the present disclosure. The integration sound informationrefers to information indicating an integration sound. The integrationsound is an example of an “integration sound” and a “comprehensivesound” according to the technology of the present disclosure. Theintegration sound refers to a sound obtained by integrating a pluralityof sounds obtained by the plurality of sound collection devices 100. Thetarget subject emphasis sound information refers to informationindicating a sound (hereinafter, also referred to as a “target subjectemphasis sound”) including the target sound (hereinafter, also referredto as “emphasis target sound”) that is emphasized more than a peripheralsound. The peripheral sound refers to a sound emitted from a regiondifferent from the region corresponding to the target subject positionindicated by the subject position-in-imaging region information includedin the target subject position information acquired by the specifyingunit 58C.

Here, the region corresponding to the target subject position refers to,for example, the target subject itself. Note that, not limited to theabove, in a case in which a center position of the target subject is setas the target subject position, the region corresponding to the targetsubject position may be a three-dimensional region defined by apredetermined distance from the target subject position. Examples of thethree-dimensional region defined by the predetermined distance from thetarget subject position include a spherical region within a radius of 3meters centered on the target subject position and 4 meter square cubicregion centered on the target subject position.

Here, as an example of the peripheral sound, the sound indicated by thesound information included in the sound collection device informationdifferent from the sound collection device information in which thetarget sound information is included as the sound information isadopted. The emphasis target sound is realized by making the volume ofthe peripheral sound lower than the volume of the sound indicated by thesound information on the peripheral sound or making the volume of thetarget sound higher than the volume of the target sound indicated by thetarget sound information. Note that not limited to the above, theemphasis target sound may be realized by making the volume of theperipheral sound lower than the volume of the sound indicated by thesound information on the peripheral sound and making the volume of thetarget sound higher than the volume of the target sound indicated by thetarget sound information.

The adjustment sound information generation unit 58D selectivelyexecutes a first generation process and a second generation process. Thefirst generation process is a process of generating the target subjectemphasis sound information, and the second generation process is aprocess of generating the integration sound information. The adjustmentsound information generation unit 58D selectively executes the firstgeneration process and the second generation process based on theangle-of-view information acquired from the smartphone 14.

For example, as shown in FIG. 18, the adjustment sound informationgeneration unit 58D executes the first generation process in a case inwhich the angle of view indicated by the angle-of-view information isless than a reference angle of view, and executes the second generationprocess in a case in which the angle of view indicated by theangle-of-view information is equal to or more than the reference angleof view. In the example shown in FIG. 18, in a case in which the angleof view indicated by the angle-of-view information is defined as “θ” andthe reference angle of view is defined as “θ_(th)”, the first generationprocess is executed by the adjustment sound information generation unit58D to generate the target subject emphasis sound information in a caseof “angle of view θ<reference angle of view θ_(th)”. In addition, thesecond generation process is executed by the adjustment soundinformation generation unit 58D to generate the integration soundinformation in a case of “angle of view θ≥reference angle of viewθ_(th)”.

In a case in which a content of the virtual viewpoint video 46 displayedby the HMD does not match the target subject emphasis sound, the targetsubject emphasis sound may cause the discomfort to the viewer 28.Therefore, here, a fixed value derived in advance by a sensory testand/or a computer simulation is adopted as a reference angle of viewθ_(th) as a lower limit value of the angle of view that does not causethe discomfort to the viewer 28 in a case in which the integration soundis output from the speaker 158 than a case in which the target subjectemphasis sound is output from the speaker 158.

Note that, here, the fixed value is adopted as the reference angle ofview θ_(th), but the reference angle of view is not limited to this, anda variable value that can be changed in response to the instructionreceived by the reception device 52, 76, or 152 may be adopted as thereference angle of view θ_(th).

The CPU 58 (see FIG. 9) is operated as the output unit 58E capable ofoutputting the target subject emphasis sound information generated bythe adjustment sound information generation unit 58D. The output unit58E acquires the target subject emphasis sound information from theadjustment sound information generation unit 58D and outputs theacquired target subject emphasis sound information in a case in whichthe target subject emphasis sound information is generated by executingthe first generation process. That is, the output unit 58E transmits thetarget subject emphasis sound information to the HMD 34. In addition,the output unit 58E acquires the integration sound information from theadjustment sound information generation unit 58D and outputs theacquired integration sound information in a case in which theintegration sound information is generated by executing the secondgeneration process. That is, the output unit 58E transmits theintegration sound information to the HMD 34.

The output of the target subject emphasis sound information and theintegration sound information by the output unit 58E is performed insynchronization with the output of the virtual viewpoint video 46 to theHMD 34 by the video generation unit 58A. In this case, the videogeneration unit 58A outputs a synchronization signal to the output unit58E at the timing when the output of the virtual viewpoint video 46 isstarted. The output of the target subject emphasis sound information andthe integration sound information by the output unit 58E is performed inresponse to the input of the synchronization signal from the videogeneration unit 58A.

In the HMD 34, the target subject emphasis sound information transmittedfrom the output unit 58E is received by the CPU 160, and the targetsubject emphasis sound which is indicated by the received target subjectemphasis sound information is output from the speaker 158. In addition,in the HMD 34, the integration sound information transmitted from theoutput unit 58E is received by the CPU 160, and the integration soundindicated by the received integration sound information is output fromthe speaker 158.

Next, an operation of the information processing system 10 will bedescribed.

First, an example of a flow of the video generation process executed bythe CPU 58 of the information processing apparatus 12 according to thevideo generation program 60A will be described with reference to FIG.20.

In the video generation process shown in FIG. 20, first, the videogeneration unit 58A acquires the bird's-eye view video 46A, the imagingvideo 46B, and the viewpoint/visual line/angle-of-view information instep ST10, and then the video generation process proceeds to step ST12.

In step ST12, the video generation unit 58A generates the virtualviewpoint video 46 that is in focus at infinity by using the bird's-eyeview video 46A and the imaging video 46B, which are acquired step ST10,based on the viewpoint/visual line/angle-of-view information acquired instep ST10, and then the video generation process proceeds to step ST14.

In step ST14, the video generation unit 58A outputs the virtualviewpoint video 46 generated in step ST12 to the HMD 34, and then thevideo generation process proceeds to step ST16. The virtual viewpointvideo 46 output to the HMD 34 by the execution of the process of stepST14 is displayed on the display 156 in the HMD 34 and is visuallyrecognized by the viewer 28.

In step ST16, the target subject position information acquisition unit58B1 acquires the target subject position information derived by the CPU160 based on the detection result by the eye tracker 166, and then thevideo generation process proceeds to step ST18.

In step ST18, the video generation unit 58A acquires the bird's-eye viewvideo 46A, the imaging video 46B, and the viewpoint/visualline/angle-of-view information, and then the video generation processproceeds to step ST20.

In step ST20, the video generation unit 58A generates the virtualviewpoint video 46 that is in focus with respect to the target subjectimage 47 by using the bird's-eye view video 46A and the imaging video46B, which are acquired step ST18, based on the viewpoint/visualline/angle-of-view information acquired in step ST18 and the targetsubject position information acquired in step ST16, and then the videogeneration process proceeds to step ST22.

In step ST22, the video generation unit 58A outputs the virtualviewpoint video 46 generated in step ST20 to the HMD 34, and then thevideo generation process proceeds to step ST24. The virtual viewpointvideo 46 output to the HMD 34 by the execution of the process of stepST22 is displayed on the display 156 in the HMD 34 and is visuallyrecognized by the viewer 28.

In step ST24, the CPU 58 determines whether or not a condition forterminating the video generation process (video generation processtermination condition) is satisfied. Examples of the video generationprocess termination condition include a condition that an instructionfor terminating the video generation process is received by thereception device 52, 76, or 152. In a case in which the video generationprocess termination condition is not satisfied in step ST24, a negativedetermination is made, and the video generation process proceeds to stepST16. In a case in which the video generation process terminationcondition is satisfied in step ST24, a positive determination is made,and the video generation process is terminated.

Next, an example of a flow of the sound generation process executed bythe CPU 58 of the information processing apparatus 12 according to thesound generation program 60B will be described with reference to FIGS.21 and 22. Note that, here, the description will be made on the premisethat the synchronization signal is output from the video generation unit58A to the output unit 58E at the timing when the output of the virtualviewpoint video 46 by the video generation unit 58A is started.

In the sound generation process shown in FIG. 21, first, the soundcollection device side information acquisition unit 58B2 acquires thesound information and the sound collection position specificationinformation from each of the plurality of sound collection devices 100in step ST50, and then the sound generation process proceeds to stepST52.

In step ST52, the sound collection device side information acquisitionunit 58B2 generates the sound collection device information for each ofthe plurality of sound collection devices 100 based on the soundinformation and the sound collection position specification information,which are acquired in step ST50, and then the sound generation processproceeds to step ST54.

In step ST54, the adjustment sound information generation unit 58Dacquires the angle-of-view information from the smartphone 14, and thenthe sound generation process proceeds to step ST56.

In step ST56, the adjustment sound information generation unit 58Ddetermines whether or not the angle of view indicated by theangle-of-view information acquired in step ST54 is less than thereference angle of view. In step ST56, in a case in which the angle ofview indicated by the angle-of-view information acquired in step ST54 isequal to or more than the reference angle of view, a negativedetermination is made, and the sound generation process proceeds to stepST58 shown in FIG. 22. In step ST56, in a case in which the angle ofview indicated by the angle-of-view information acquired in step ST54 isless than the reference angle of view, a positive determination is made,and the sound generation process proceeds to step ST64.

In step ST58 shown in FIG. 22, the adjustment sound informationgeneration unit 58D generates the integration sound information based onthe sound collection device information generated in step ST52, and thenthe sound generation process proceeds to step ST60.

In step ST60, the output unit 58E determines whether or not thesynchronization signal is input from the video generation unit 58A. Instep ST60, in a case in which the synchronization signal is not inputfrom the video generation unit 58A, a negative determination is made,and the determination in step ST60 is made again. In a case in which thesynchronization signal is input from the video generation unit 58A instep ST60, a positive determination is made, and the sound generationprocess proceeds to step ST62.

In step ST62, the output unit 58E outputs the integration soundinformation generated in step ST58 to the HMD 34, and then the soundgeneration process proceeds to step ST74 shown in FIG. 21. Theintegration sound indicated by the integration sound information outputto the HMD 34 by the execution of the process of step ST62 is outputfrom the speaker 158 in the HMD 34 and heard by the viewer 28.

In step ST64 shown in FIG. 21, the target subject position informationacquisition unit 58B1 acquires the target subject position informationfrom the HMD 34, and then the sound generation process proceeds to stepST66.

In step ST66, the specifying unit 58C specifies the target soundinformation based on the sound collection device information generatedin step ST52 and the target subject position information acquired instep ST64, and then the sound generation process proceeds to step ST68.

In step ST68, the adjustment sound information generation unit 58Dgenerates the target subject emphasis sound information based on thesound collection device information generated in step ST50 and thetarget sound information specified in step ST66, and then the soundgeneration process proceeds to step ST70.

In step ST70, the output unit 58E determines whether or not thesynchronization signal is input from the video generation unit 58A. Instep ST70, in a case in which the synchronization signal is not inputfrom the video generation unit 58A, a negative determination is made,and the determination in step ST70 is made again. In a case in which thesynchronization signal is input from the video generation unit 58A instep ST70, a positive determination is made, and the sound generationprocess proceeds to step ST72.

In step ST72, the output unit 58E outputs the target subject emphasissound information generated in step ST68 to the HMD 34, and then thesound generation process proceeds to step ST74. The target subjectemphasis sound indicated by the target subject emphasis soundinformation output to the HMD 34 by the execution of the process of stepST72 is output from the speaker 158 in the HMD 34 and heard by theviewer 28.

In step ST74, the CPU 58 determines whether or not a condition forterminating the sound generation process (sound generation processtermination condition) is satisfied. Examples of the sound generationprocess termination condition include a condition that an instructionfor terminating the sound generation process is received by thereception device 52, 76, or 152. In a case in which the sound generationprocess termination condition is not satisfied in step ST74, a negativedetermination is made, and the sound generation process proceeds to stepST50. In a case in which the sound generation process terminationcondition is satisfied in step ST74, a positive determination is made,and the sound generation process is terminated.

As described above, in the information processing apparatus 12, thetarget subject position information acquisition unit 58B1 acquires thetarget subject position information from the HMD 34, and the soundcollection device side information acquisition unit 58B2 acquires thesound information and the sound collection position specificationinformation from each of the plurality of sound collection devices 100.In addition, the specifying unit 58C specifies the target sound in theregion corresponding to the target subject position from the pluralityof pieces of sound information based on the sound collection positionspecification information and the target subject position information.Moreover, in a case in which the virtual viewpoint video 46 isgenerated, the target subject emphasis sound information is generated bythe adjustment sound information generation unit 58D. The target subjectemphasis sound information is the information indicating the targetsubject emphasis sound. The target subject emphasis sound is a soundincluding the emphasis target sound in which the target sound isemphasized more than the sound (peripheral sound) emitted from theregion different from the region corresponding to the target subjectposition indicated by the target subject position information acquiredby the target subject position information acquisition unit 58B1.Therefore, it is possible to contribute to the listening of the soundemitted from the region corresponding to the target subject positionindicated by the generated virtual viewpoint video 46 by the viewer 28.

In addition, in the information processing apparatus 12, the adjustmentsound information generation unit 58D selectively executes the firstgeneration process and the second generation process. The target subjectemphasis sound information is generated in the first generation process,and the integration sound information is generated in the secondgeneration process. Therefore, it is possible to selectively generatethe target subject emphasis sound information and the integration soundinformation.

In addition, in the information processing apparatus 12, the firstgeneration process is executed in a case in which the angle of viewindicated by the angle-of-view information is less than the referenceangle of view, and the second generation process is executed in a casein which the angle of view indicated by the angle-of-view information isequal to or more than the reference angle of view. Therefore, it ispossible to selectively generate the target subject emphasis soundinformation and the integration sound information depending on the angleof view.

In addition, in the information processing apparatus 12, the eye tracker166 detects the observation direction of the viewer 28 who observes thevirtual viewpoint video 46 in a state in which the virtual viewpointvideo 46 is displayed on the display 156 of the HMD 34. Here, the CPU160 generates the target subject position information based on the thirdposition association information and the detection result by the eyetracker 166, and the generated target subject position information isacquired by the target subject position information acquisition unit58B1. The target subject position information acquired by the targetsubject position information acquisition unit 58B1 is used forspecifying the target sound information by the specifying unit 58C, andthe target sound information specified by the specifying unit 58C isused for generating the target subject emphasis sound information by theadjustment sound information generation unit 58D. Therefore, it ispossible to suppress erroneous generation of the information indicatingthe sound emitted from the position of the direction irrelevant to theobservation direction of the viewer 28 as the target subject emphasissound information.

In addition, in the information processing apparatus 12, the visual linedirection of the viewer 28 is detected as the observation direction bythe eye tracker 166 based on the eye image obtained by imaging the eyes29 of the viewer 28 by the imaging element 166A. Therefore, it ispossible to detect the observation direction with higher accuracy ascompared to a case in which the direction different from the visual linedirection of the viewer 28 is detected as the observation direction.

In addition, in the information processing apparatus 12, the HMD 34 ismounted on the viewer 28, and the HMD 34 is provided with the eyetracker 166. Therefore, as compared to a case in which the eye tracker166 is not provided on the HMD 34, it is possible to detect theobservation direction with higher accuracy in a state in which the HMD34 is mounted on the viewer 28.

In addition, in the information processing apparatus 12, the targetsubject image is the image in the virtual viewpoint video 46 that ismore in focus than the image in a periphery of the target subject image.Therefore, it is possible to specify the position at which the targetsubject emphasis sound is emitted from the virtual viewpoint video 46.

Further, in the information processing apparatus 12, the plurality ofsound collection devices 100 are fixed in the imaging region. Therefore,it is possible to easily acquire the sound collection positionspecification information as compared to a case in which the pluralityof sound collection devices 100 are moved.

Note that in the embodiment described above, the aspect example has beendescribed in which the target subject position information acquisitionunit 58B1 acquires the target subject position information based on thedetection result by the eye tracker 166, but the technology of thepresent disclosure is not limited to this. For example, the targetsubject position information acquisition unit 58B1 may acquire thetarget subject position information based on the instruction received bythe reception device 52, 76, or 152. In this case, first, in a state inwhich the distribution video (here, for example, the virtual viewpointvideo 46) is displayed by the HMD 34, the indication information forindicating the target subject image position in the distribution videois received by the reception device 52, 76, or 152. Moreover, the targetsubject position information acquisition unit 58B1 acquires the targetsubject position information based on the third position associationinformation and the indication information received by the receptiondevice 52, 76, or 152. That is, the target subject position informationacquisition unit 58B1 acquires the target subject position informationby deriving the position-in-imaging region specification informationcorresponding to the target subject image position as indicated by theindication information as the target subject position information fromthe third position association information.

With the present configuration, it is possible to suppress the erroneousgeneration of the sound information indicating the sound emitted fromthe position that is not intended by the viewer 28 as the target subjectposition as the target subject emphasis sound information as compared toa case in which the indication of the target subject position is givenby using the image irrelevant to the imaging region. Note that, here,the reception device 52, 76, or 152 is an example of a “reception device(acceptor)” according to the technology of the present disclosure.

In addition, in the embodiment described above, the aspect example hasbeen described in which the target subject emphasis sound information isnot generated in a case in which the angle of view indicated by theangle-of-view information is equal to or more than the reference angleof view, but the technology of the present disclosure is not limited tothis. For example, the target subject emphasis sound information may notbe generated in a case in which a frequency at which the observationdirection of the viewer 28 changes per unit time (hereinafter, referredto as an “observation direction change frequency”) is equal to or morethan a predetermined frequency.

In this case, the first generation process and the second generationprocess need only be selectively executed by the adjustment soundinformation generation unit 58D depending on the observation directionchange frequency. In a case in which the first generation process andthe second generation process are selectively executed depending on theobservation direction change frequency, for example, as shown in FIG.23, first, in the HMD 34, the CPU 160 calculates the observationdirection change frequency (for example, N times/seconds) based on theobservation direction specification information. The CPU 160 outputsobservation direction change frequency information indicating thecalculated frequency to the adjustment sound information generation unit58D. The adjustment sound information generation unit 58D executes thefirst generation process or the second generation process by referringto the observation direction change frequency information. That is, in acase in which the observation direction change frequency is equal to ormore than the predetermined frequency, the second generation process isexecuted without executing the first generation process. In addition, ina case in which the observation direction change frequency is less thanthe predetermined frequency, the first generation process is executedwithout executing the second generation process.

In a case in which the target subject emphasis sound is output from thespeaker 158 in a state in which the observation direction is notdetermined, the target subject emphasis sound may cause the discomfortto the viewer 28. Therefore, here, a fixed value derived in advance by asensory test and/or a computer simulation is adopted as thepredetermined frequency as a lower limit value of the observationdirection change frequency that does not cause the discomfort to theviewer 28 in a case in which the integration sound is output from thespeaker 158 than a case in which the target subject emphasis sound isoutput from the speaker 158.

Note that, here, although the fixed value is adopted as thepredetermined frequency, the variable value that can be changed inresponse to the instruction received by the reception device 52, 76, or152 may be adopted as the predetermined frequency.

In a case in which the first generation process and the secondgeneration process are selectively executed depending on the observationdirection change frequency, for example, as shown in FIG. 24, the soundgeneration process executed by the CPU 160 is different from the soundgeneration process shown in FIG. 21 in that step ST100 is providedinstead of step ST54 and step ST102 is provided instead of step ST56.

In step ST100, the adjustment sound information generation unit 58Dacquires the observation direction change frequency information from theHMD 34, and then the sound generation process proceeds to step ST102.

In step ST102, the adjustment sound information generation unit 58Ddetermines whether or not the observation direction change frequencyindicated by the observation direction change frequency informationacquired in step ST100 is less than the predetermined frequency. In stepST102, in a case in which the observation direction change frequencyindicated by the observation direction change frequency informationacquired in step ST100 is equal to or more than the predeterminedfrequency, a negative determination is made, and the sound generationprocess proceeds to step ST58 shown in FIG. 22. In step ST102, in a casein which the observation direction change frequency indicated by theobservation direction change frequency information acquired in stepST100 is less than the predetermined frequency, a positive determinationis made, and the sound generation process proceeds to step ST64.

With the present configuration, it is possible to reduce the discomfortgiven to the viewer 28 due to the frequent switching of the targetsubject emphasis sound as compared to a case in which the target subjectemphasis sound is also switched as the target subject is frequentlyswitched.

Note that in the example shown in FIG. 24, in a case in which theobservation direction change frequency is equal to or more than thepredetermined frequency, the sound generation process proceeds to stepST58 shown in FIG. 22, but the technology of the present disclosure isnot limited to this. For example, in a case in which the observationdirection change frequency is equal to or more than the predeterminedfrequency and the angle of view indicated by the angle-of-viewinformation is equal to or more than the reference angle of view, thesound generation process may proceed to step ST58 shown in FIG. 22.

In addition, in the example shown in FIG. 24, in a case in which theobservation direction change frequency is less than the predeterminedfrequency, the sound generation process proceeds to step ST64, but thetechnology of the present disclosure is not limited to this. Forexample, in a case in which the observation direction change frequencyis less than the predetermined frequency and the angle of view indicatedby the angle-of-view information is less than the reference angle ofview, the sound generation process may proceed to step ST58 shown inFIG. 22.

Note that, here, although the aspect example has been described in whichthe target subject emphasis sound information is not generated in a casein which the observation direction change frequency is equal to or morethan the predetermined frequency, the technology of the presentdisclosure is not limited to this. For example, in a case in which theobservation direction change frequency is equal to or more than thepredetermined frequency, the target subject emphasis sound informationmay be generated, and the generated target subject emphasis soundinformation may not be output by the output unit 58E. In this case aswell, since the target subject emphasis sound is not output from thespeaker 158, it is possible to reduce the discomfort given to the viewer28 due to the frequent switching of the target subject emphasis sound ascompared to a case in which the target subject emphasis sound is alsoswitched as the target subject is frequently switched.

In addition, in the embodiment described above, the aspect example hasbeen described in which in a case in which the angle of view indicatedby the angle-of-view information is equal to or more than the referenceangle of view, the target subject emphasis sound information is notgenerated, but the technology of the present disclosure is not limitedto this. For example, in a case in which the angle of view indicated bythe angle-of-view information is equal to or more than the referenceangle of view, the target subject emphasis sound information may begenerated, and the generated target subject emphasis sound informationmay not be output by the output unit 58E.

In addition, in the embodiment described above, the aspect example hasbeen described in which the integration sound information is generatedby executing the second generation process by the adjustment soundinformation generation unit 58D, but the technology of the presentdisclosure is not limited to this. For example, the adjustment soundinformation generation unit 58D may execute the second generationprocess to generate stepwise emphasis sound information. The stepwiseemphasis sound information is information including the integrationsound information, intermediate sound information, and the targetsubject emphasis sound information. The intermediate sound informationis information indicating an intermediate sound in which the targetsubject sound is emphasized more than the integration sound andsuppressed more than the target subject emphasis sound. In this case, ina case in which the observation direction change frequency is equal toor more than the predetermined frequency, the output unit 58E outputsthe integration sound information, the intermediate sound information,and the target subject emphasis sound information, which are generatedby the adjustment sound information generation unit 58D, to the HMD 34in order of the integration sound information, the intermediate soundinformation, and the target subject emphasis sound information.

In this case, the sound generation process executed by the CPU 58 (seeFIG. 26) is different from the sound generation process shown in FIG. 22in that step ST150 is provided instead of step ST58 and step ST152 isprovided instead of step ST62. The stepwise emphasis sound informationis generated by the adjustment sound information generation unit 58D instep ST150 shown in FIG. 26, and the stepwise emphasis sound informationgenerated in step ST150 is output to the HMD 34 by the output unit 58Ein step ST152.

With the present configuration, the integration sound information, theintermediate sound information, and the target subject emphasis soundinformation are output to the HMD 34 from the speaker 158 in order ofthe integration sound, the intermediate sound, and the target subjectemphasis sound, and are heard by the viewer 28. With the presentconfiguration, it is possible to reduce the discomfort given to theviewer 28 due to the frequent switching of the target subject emphasissound as compared to a case in which the target subject emphasis soundis also switched as the target subject is frequently switched.

Note that the intermediate sound information may be informationincluding a plurality of pieces of sound information subdivided suchthat the volume is gradually increased in a non-step manner or amulti-step manner.

In addition, in the embodiment described above, the informationindicating the emphasis sound including the emphasis target sound isadopted as the target subject emphasis sound information, but the targetsubject emphasis sound information may be information indicating thesound including the emphasis target sound and not including theperipheral sound. As a result, it is possible to contribute to easylistening of the target sound as compared to a case in which the targetsubject emphasis sound information is the information indicating thesound including the peripheral sound in addition to the emphasis targetsound.

In addition, in the embodiment described above, the HMD 34 has beendescribed as an example, but the technology of the present disclosure isnot limited to this. For example, as shown in FIG. 27, the targetsubject position information may be acquired by the target subjectposition information acquisition unit 58B1 based on the detection resultby the eye tracker 166 provided in the specific HMD 34 among a pluralityof the HMDs 34 and the third position association information. In theexample shown in FIG. 27, the HMD 34 is mounted on each of viewers 28Ato 28Z (hereinafter, in a case in which a distinction is not necessary,the viewers 28A to 28Z are simply referred to as “viewer” withoutreference numerals). The target subject position information acquisitionunit 58B1 acquires the target subject position information based on thedetection result by the eye tracker 166 provided in the HMD 34 mountedon any of the viewers 28A to 28Z and the third position associationinformation. With the present configuration, it is possible to generatethe target subject emphasis sound information corresponding to thetarget subject at which the viewer who mounts the specific HMD 34 amongthe plurality of HMDs 34 directs attention.

In addition, in the embodiment described above, the aspect example hasbeen described in which the plurality of sound collection devices 100are fixed in the imaging region, but the technology of the presentdisclosure is not limited to this. For example, as shown in FIG. 28, asound collection device 300 may be attached to a target subject 47A. Thesound collection device 300 comprises a computer 302, a GPS receiver304, a microphone 306, a communication I/F 308, and a bus line 316. Thecomputer 302 comprises a CPU 310, a storage 312, and a memory 314. Notethat in the example shown in FIG. 28, for convenience of illustration,one bus line is shown as the bus line 316, but a data bus, an addressbus, a control bus, and the like are included in the bus line 316similar to the bus lines 64, 94, and 170 described in the embodimentabove.

The computer 302 corresponds to the computer 200 shown in FIG. 8. Themicrophone 306 corresponds to the microphone 207 shown in FIG. 8. Thecommunication I/F 308 corresponds to the communication I/F 218 shown inFIG. 8. The CPU 310 corresponds to the CPU 210 shown in FIG. 8. Thestorage 312 corresponds to the storage 212 shown in FIG. 8. The memory314 corresponds to the memory 214 shown in FIG. 8.

The GPS receiver 304 receives radio waves from a plurality of GPSsatellites (not shown) depending on the instruction from the CPU 310,and outputs reception result information indicating a reception resultto the CPU 310. The CPU 310 calculates GPS information indicatinglatitude, longitude, and altitude based on reception result informationinput from the GPS receiver 304. The CPU 310 performs the wirelesscommunication with the information processing apparatus 12 via the basestation 20 to transmit the sound information obtained from themicrophone 306 to the information processing apparatus 12 and totransmit the GPS information to the information processing apparatus 12as the sound collection position specification information. As a result,the position of the target subject 47A in the imaging region, that is,the target subject position is specified by the information processingapparatus 12. Here, the aspect example has been described in which theGPS information is used as the sound collection position specificationinformation, but the technology of the present disclosure is not limitedto this, and any information may be used as long as the position of thesound collection device 300 in the imaging region can be specified bythe information. In addition, a plurality of the sound collectiondevices 300 may be attached to the target subject 47A.

With the present configuration, it is possible to easily obtain thetarget sound as compared to a case in which the sound collection device300 is not attached to the target subject 47A.

Note that, here, although the aspect example has been described in whichthe sound collection device 300 is attached to only one target subject47A, but the technology of the present disclosure is not limited tothis. For example, the sound collection device 300 may be attached toeach of a plurality of persons (for example, a player and/or a refereein the soccer field 24) who can be the target subject present in theimaging region. With the present configuration, it is possible to easilyobtain the target sound even in a case in which the target subject isswitched between the plurality of persons, as compared to a case inwhich the sound collection device 300 is not attached to each of theplurality of persons in the imaging region.

In addition, in the embodiment described above, the aspect example hasbeen described in which the plurality of sound collection devices 300are fixed in the imaging region, but the plurality of sound collectiondevices 300 and the sound collection device 300 attached to each of theplurality of persons may be used in combination.

In addition, in the embodiment described above, the aspect example hasbeen described in which the volume of the sound information obtained bythe sound collection device 100 is not changed and used by theinformation processing apparatus 12, but the technology of the presentdisclosure is not limited to this. For example, the volume may be madedifferent among a plurality of sounds indicated by the plurality ofpieces of sound information obtained by the plurality of soundcollection devices 100.

In this case, the specifying unit 58C specifies a positionalrelationship between the target subject position and the plurality ofsound collection devices 100 by using the sound collection positionspecification information acquired by the sound collection device sideinformation acquisition unit 58B2 and the target subject positioninformation acquired by the target subject position informationacquisition unit 58B1. Moreover, the adjustment sound informationgeneration unit 58D controls the sound information such that the soundindicated by the sound information is the sound adjusted to be smalleras the sound is positioned farther from the target subject positiondepending on the positional relationship specified by the specifyingunit 58C, for example, as shown in FIG. 29. The sound informationcontrolled in this way is used, for example, for generating the targetsubject emphasis sound information and the integration sound informationby the adjustment sound information generation unit 58D. With thepresent configuration, even in a state in which the target sound and theperipheral sound are mixed, it is possible to contribute to thedistinguishable hearing of the target sound and the peripheral sound.

Note that in the example shown in FIG. 29, the aspect has been describedin which volume of the sound indicated by the sound information isattenuated linearly with respect to the distance from the target subjectposition to the sound collection device 100, but the technology of thepresent disclosure is not limited to this, and the volume of the soundindicated by the sound information may be attenuated non-linearly withrespect to the distance from the target subject position to the soundcollection device 100. In addition, the volume of the sound indicated bythe sound information may be attenuated in a stepwise manner. In a casein which the volume is attenuated in a stepwise manner, a time intervalof the same volume may be gradually shortened or lengthened.

In addition, in the embodiment described above, the aspect example hasbeen described in which the first generation process is executed in acase in which the angle of view indicated by the angle-of-viewinformation is less than the reference angle of view, and the secondgeneration process is executed in a case in which the angle of viewindicated by the angle-of-view information is equal to or more than thereference angle of view, the technology of the present disclosure is notlimited to this. For example, as shown in FIG. 30, the second generationprocess may be executed in a case in which the visual field in a case inwhich the imaging region is observed from the viewpoint position 42 isthe visual field that surrounds a preset reference region 24B in thesoccer field 24.

On the other hand, for example, as shown in FIG. 31, the firstgeneration process may be executed in a case in which the visual fieldin a case in which the imaging region is observed from the viewpointposition 42 is within the reference region 24B.

Note that the determination as to whether or not the visual field in acase in which the imaging region is observed from the viewpoint position42 is the visual field that surrounds the reference region 24B need onlybe made by determining whether or not the image showing the wholereference region 24B is included in the virtual viewpoint video 46generated by the video generation unit 58A by the CPU 58.

In addition, for example, as shown in FIG. 32, in a case in which thereference region 24B is not within in the visual field from theviewpoint position 42, the second generation process may be executedwithout executing the first generation process. Note that in theexamples shown in FIGS. 30 to 32, a rectangular region is adopted as thereference region 24B, but the shape of the reference region 24B is notlimited to this, and may be another shaped region, such as a circularregion or a polygon region other than a rectangle.

In addition, in the embodiment described above, the aspect example hasbeen described in which the CPU 58 of the information processingapparatus 12 executes the video generation process and the soundgeneration process (hereinafter, in a case in which a distinction is notnecessary, referred to as “information processing apparatus sideprocess”), but the technology of the present disclosure is not limitedto this, and the information processing apparatus side process may beexecuted by the terminal device or distributed or may be executed by aplurality of devices, such as the smartphone 14 and the HMD 34.

In addition, the HMD 34 may be caused to execute the informationprocessing apparatus side process. In this case, for example, as shownin FIG. 33, the information processing apparatus program is stored inthe storage 162 of the HMD 34. The CPU 160 executes the video generationprocess by being operated as the video generation unit 58A and theacquisition unit 58B according to the video generation program 60A. Inaddition, the CPU 160 executes the sound generation process by beingoperated as the acquisition unit 58B, the specifying unit 58C, theadjustment sound information generation unit 58D, and the output unit58E according to the sound generation program 60B.

In addition, in the embodiment described above, the HMD 34 has beendescribed as an example, but the technology of the present disclosure isnot limited to this, and the HMD 34 can be substituted with variousdevices equipped with an arithmetic device, such as a smartphone, atablet terminal, a head-up display, or a personal computer.

In addition, in the embodiment described above, the soccer stadium 22has been described as an example, but it is merely an example, and anyplace, such as a baseball stadium, a rugby stadium, a curling stadium,an athletics stadium, a swimming pool, a concert hall, an outdoor musichall, and a theater venue, may be adopted as long as the plurality ofimaging apparatuses and the plurality of sound collection devices 100can be installed.

In addition, in the embodiment described above, the wirelesscommunication method using the base station 20 has been described as anexample, but it is merely an example, and the technology of the presentdisclosure is established even in the wired communication method usingthe cable.

In addition, in the embodiment described above, the unmanned aerialvehicle 27 has been described as an example, but the technology of thepresent disclosure is not limited to this, and the imaging region may beimaged by the imaging apparatus 18 suspended by a wire (for example, aself-propelled imaging apparatus that can move along the wire).

In addition, in the above description, the computers 50, 70, 100, 150,200, and 302 have been described as examples, but the technology of thepresent disclosure is not limited to theses. For example, instead of thecomputers 50, 70, 100, 150, 200, and/or 302, a device including an ASIC,an FPGA, and/or a PLD may be applied. In addition, instead of thecomputers 50, 70, 100, 150, 200, and/or 302, a combination of a hardwareconfiguration and a software configuration may be used.

In addition, in the embodiment described above, the informationprocessing apparatus program is stored in the storage 60, but thetechnology of the present disclosure is not limited to this, and asshown in FIG. 34, for example, the information processing apparatusprogram may be stored in any portable storage medium 400, such as an SSDor a USB memory, which is a non-transitory storage medium. In this case,the information processing apparatus program stored in the storagemedium 400 is installed in the computer 50, and the CPU 58 executes theinformation processing apparatus side process according to theinformation processing apparatus program.

In addition, the information processing apparatus program may be storedin a storage unit of another computer or a server device connected tothe computer 50 via a communication network (not shown), and theinformation processing apparatus program may be downloaded to theinformation processing apparatus 12 in response to the request of theinformation processing apparatus 12. In this case, the informationprocessing apparatus side process based on the downloaded informationprocessing apparatus program is executed by the CPU 58 of the computer50.

In addition, in the embodiment described above, the CPU 58 has beendescribed as an example, but the technology of the present disclosure isnot limited to this, and a GPU may be adopted. In addition, a pluralityof CPUs may be adopted instead of the CPU 58. That is, the informationprocessing apparatus side process may be executed by one processor or aplurality of physically separated processors. In addition, instead ofthe CPUs 88, 160, 210, and/or 310, a GPU may be adopted, a plurality ofCPUs may be adopted, or one processor or a plurality of physicallyseparated processors may be adopted to execute various processes.

The following various processors can be used as a hardware resource forexecuting the information processing apparatus side process. Examples ofthe processor include a CPU, which is a general-purpose processor thatfunctions as software, that is, the hardware resource for executing theinformation processing apparatus side process according to the program,as described above. In addition, another example of the processorincludes a dedicated electric circuit which is a processor having acircuit configuration specially designed for executing a specificprocess, such as an FPGA, a PLD, or an ASIC. A memory is also built inor connected to each processor, and each processor executes theinformation processing apparatus side process by using the memory.

The hardware resource for executing the information processing apparatusside process may be configured by one of the various processors, or maybe a combination of two or more processors of the same type or differenttypes (for example, a combination of a plurality of FPGAs or acombination of a CPU and an FPGA). In addition, the hardware resourcefor executing the information processing apparatus side process may beone processor.

As an example of configuring the hardware resource with one processor,first, as represented by a computer such as a client computer or aserver, there is a form in which one processor is configured by acombination of one or more CPUs and software, and the processorfunctions as the hardware resource for executing the informationprocessing apparatus side process. Secondly, as represented by SoC,there is an aspect in which a processor that realizes the functions ofthe whole system including a plurality of the hardware resources forexecuting the information processing apparatus side process with one ICchip is used. In this way, the information processing apparatus sideprocess is realized by using one or more of the various processorsdescribed above as the hardware resource.

Further, as the hardware structure of these various processors, morespecifically, an electric circuit in which circuit elements such assemiconductor elements are combined can be used.

In addition, the information processing apparatus side process describedabove is merely an example. Therefore, it is needless to say thatunnecessary steps may be deleted, new steps may be added, or the processorder may be changed within a range that does not deviate from the gist.

The contents described and shown above are the detailed description ofthe parts according to the technology of the present disclosure, and aremerely examples of the technology of the present disclosure. Forexample, the description of the configuration, the function, the action,and the effect above are the description of examples of theconfiguration, the function, the action, and the effect of the partsaccording to the technology of the present disclosure. Accordingly, itis needless to say that unnecessary parts may be deleted, new elementsmay be added, or replacements may be made with respect to the contentsdescribed and shown above within a range that does not deviate from thegist of the technology of the present disclosure. In addition, in orderto avoid complications and facilitate understanding of the partsaccording to the technology of the present disclosure, in the contentsdescribed and shown above, the description of common technical knowledgeand the like that do not particularly require description for enablingthe implementation of the technology of the present disclosure areomitted.

In the present specification, “A and/or B” is synonymous with “at leastone of A or B”. That is, “A and/or B” means that it may be only A, onlyB, or a combination of A and B. In addition, in the presentspecification, in a case in which three or more matters are associatedand expressed by “and/or”, the same concept as “A and/or B” is applied.

All of the documents, the patent applications, and the technicalstandards described in the present specification are incorporated in thepresent specification by referring to the same extent as a case in whichindividual document, patent application, and technical standard arespecifically and individually noted to be incorporated by reference.

Regarding the embodiment described above, the following supplementarynote will be further disclosed.

(Supplementary Note 1)

An information processing apparatus including a processor, and a memorybuilt in or connected to the processor,

in which the processor acquires a plurality of pieces of soundinformation indicating sounds obtained by a plurality of soundcollection devices scattered in an imaging region, a sound collectiondevice position information indicating a position of each of theplurality of sound collection devices in the imaging region, and atarget subject position information indicating a position of a targetsubject in the imaging region,

specifies a target sound of a region corresponding to the position ofthe target subject from the plurality of pieces of sound informationbased on the acquired sound collection device position information andthe acquired target subject position information, and

generates target subject emphasis sound information indicating a soundincluding a target subject emphasis sound in which the specified targetsound is emphasized more than a sound emitted from a region differentfrom the region corresponding to the position of the target subjectindicated by the acquired target subject position information based onviewpoint position information indicating a position of a virtualviewpoint with respect to the imaging region, visual line directioninformation indicating a virtual visual line direction with respect tothe imaging region, angle-of-view information indicating an angle ofview with respect to the imaging region, and the target subject positioninformation in a case in which a virtual viewpoint video is generated byusing a plurality of images obtained by imaging the imaging region by aplurality of imaging apparatuses in a plurality of directions.

What is claimed is:
 1. An information processing apparatus comprising: aprocessor; and a memory built in or connected to the processor, whereinthe processor acquires a plurality of pieces of sound informationindicating sounds obtained by a plurality of sound collection devices, asound collection device position information indicating a position ofeach of the plurality of sound collection devices, and a target subjectposition information indicating a position of a target subject in animaging region, specifies a target sound of a region corresponding tothe position of the target subject from the plurality of pieces of soundinformation based on the acquired sound collection device positioninformation and the acquired target subject position information, andgenerates target subject emphasis sound information indicating a soundincluding a target subject emphasis sound in which the specified targetsound is emphasized more than a sound emitted from a region differentfrom the region corresponding to the position of the target subjectindicated by the acquired target subject position information in a casein which a virtual viewpoint video is generated, based on viewpointposition information indicating a position of a virtual viewpoint withrespect to the imaging region, visual line direction informationindicating a virtual visual line direction with respect to the imagingregion, angle-of-view information indicating an angle of view withrespect to the imaging region, and the target subject positioninformation, by using a plurality of images obtained by imaging theimaging region by a plurality of imaging apparatuses in a plurality ofdirections.
 2. The information processing apparatus according to claim1, wherein the processor selectively executes a first generation processof generating the target subject emphasis sound information, and asecond generation process of generating, based on the acquired soundinformation, integration sound information indicating an integrationsound obtained by integrating a plurality of the sounds obtained by theplurality of sound collection devices.
 3. The information processingapparatus according to claim 2, wherein the processor executes the firstgeneration process in a case in which the angle of view indicated by theangle-of-view information is less than a reference angle of view, andexecutes the second generation process in a case in which the angle ofview indicated by the angle-of-view information is equal to or more thanthe reference angle of view.
 4. The information processing apparatusaccording to claim 1, wherein indication information for indicating aposition of a target subject image showing the target subject in animaging region image showing the imaging region is received by areception device in a state in which the imaging region image isdisplayed by a display device, and the processor acquires the targetsubject position information based on correspondence informationindicating a correspondence between a position in the imaging region anda position in the imaging region image showing the imaging region, andthe indication information received by the reception device.
 5. Theinformation processing apparatus according to claim 1, wherein anobservation direction of a person who observes an imaging region imageshowing the imaging region is detected by a detector in a state in whichthe imaging region image is displayed by a display device, and theprocessor acquires the target subject position information based oncorrespondence information indicating a correspondence between aposition in the imaging region and a position in the imaging regionimage showing the imaging region, and a detection result by thedetector.
 6. The information processing apparatus according to claim 5,wherein the detector includes an imaging element, and detects a visualline direction of the person as the observation direction based on aneye image obtained by imaging eyes of the person by the imaging element.7. The information processing apparatus according to claim 5, whereinthe display device is a head mounted display mounted on the person, andthe detector is provided on the head mounted display.
 8. The informationprocessing apparatus according to claim 7, wherein a plurality of thehead mounted displays are present, and the processor acquires the targetsubject position information based on the detection result by thedetector provided on a specific head mounted display among the pluralityof head mounted displays, and the correspondence information.
 9. Theinformation processing apparatus according to claim 5, wherein theprocessor does not generate the target subject emphasis soundinformation in a case in which a frequency at which the observationdirection changes per unit time is equal to or more than a predeterminedfrequency.
 10. The information processing apparatus according to claim5, wherein the processor is able to output the generated target subjectemphasis sound information, and does not output the generated targetsubject emphasis sound information in a case in which a frequency atwhich the observation direction changes per unit time is equal to ormore than a predetermined frequency.
 11. The information processingapparatus according to claim 5, wherein the processor generatescomprehensive sound information indicating a comprehensive soundobtained by integrating a plurality of the sounds obtained by theplurality of sound collection devices, and intermediate soundinformation indicating an intermediate sound in which the target soundis emphasized more than the comprehensive sound and suppressed more thanthe target subject emphasis sound, and outputs the generatedcomprehensive sound information, the generated intermediate soundinformation, and the generated target subject emphasis sound informationin order of the comprehensive sound information, the intermediate soundinformation, and the target subject emphasis sound information in a casein which a frequency at which the observation direction changes per unittime is equal to or more than a predetermined frequency.
 12. Theinformation processing apparatus according to claim 1, wherein thetarget subject emphasis sound information is information indicating asound including the target subject emphasis sound and not including thesound emitted from the different region.
 13. The information processingapparatus according to claim 1, wherein the processor specifies apositional relationship between the position of the target subject andthe plurality of sound collection devices by using the acquired soundcollection device position information and the acquired target subjectposition information, and the sound indicated by each of the pluralityof pieces of sound information is a sound adjusted to be smaller as thesound is positioned farther from the position of the target subjectdepending on the positional relationship specified by the processor. 14.The information processing apparatus according to claim 1, wherein avirtual viewpoint target subject image showing the target subjectincluded in the virtual viewpoint video is an image that is in focusmore than images in a periphery of the virtual viewpoint target subjectimage in the virtual viewpoint video.
 15. The information processingapparatus according to claim 1, wherein the sound collection deviceposition information is information indicating the position of the soundcollection device fixed in the imaging region.
 16. The informationprocessing apparatus according to claim 1, wherein at least one of theplurality of sound collection devices is attached to the target subject.17. The information processing apparatus according to claim 1, whereinthe plurality of sound collection devices are attached to a plurality ofobjects including the target subject in the imaging region.
 18. Aninformation processing method comprising: acquiring a plurality ofpieces of sound information indicating sounds obtained by a plurality ofsound collection devices, a sound collection device position informationindicating a position of each of the plurality of sound collectiondevices, and a target subject position information indicating a positionof a target subject in an imaging region; specifying a target sound of aregion corresponding to the position of the target subject from theplurality of pieces of sound information based on the acquired soundcollection device position information and the acquired target subjectposition information; and generating target subject emphasis soundinformation indicating a sound including a target subject emphasis soundin which the specified target sound is emphasized more than a soundemitted from a region different from the region corresponding to theposition of the target subject indicated by the acquired target subjectposition information in a case in which a virtual viewpoint video isgenerated, based on viewpoint position information indicating a positionof a virtual viewpoint with respect to the imaging region, visual linedirection information indicating a virtual visual line direction withrespect to the imaging region, angle-of-view information indicating anangle of view with respect to the imaging region, and the target subjectposition information, by using a plurality of images obtained by imagingthe imaging region by a plurality of imaging apparatuses in a pluralityof directions.
 19. A non-transitory computer-readable storage mediumstoring a program for causing a computer to execute a processcomprising: acquiring a plurality of pieces of sound informationindicating sounds obtained by a plurality of sound collection devices, asound collection device position information indicating a position ofeach of the plurality of sound collection devices, and a target subjectposition information indicating a position of a target subject in animaging region; specifying a target sound of a region corresponding tothe position of the target subject from the plurality of pieces of soundinformation based on the acquired sound collection device positioninformation and the acquired target subject position information; andgenerating target subject emphasis sound information indicating a soundincluding a target subject emphasis sound in which the specified targetsound is emphasized more than a sound emitted from a region differentfrom the region corresponding to the position of the target subjectindicated by the acquired target subject position information in a casein which a virtual viewpoint video is generated, based on viewpointposition information indicating a position of a virtual viewpoint withrespect to the imaging region, visual line direction informationindicating a virtual visual line direction with respect to the imagingregion, angle-of-view information indicating an angle of view withrespect to the imaging region, and the target subject positioninformation, by using a plurality of images obtained by imaging theimaging region by a plurality of imaging apparatuses in a plurality ofdirections.